⬡ Hub
Skip to content

Pandas: Data Manipulation and Analysis

Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language. It is particularly well-suited for working with tabular data (like spreadsheets or database tables) and time series data.

Key Features:

  • DataFrame Object: A fast and efficient DataFrame object for data manipulation with integrated indexing.
  • Series Object: A one-dimensional labeled array capable of holding any data type.
  • Data Alignment: Handles missing data (represented as NaN) gracefully and automatically aligns data based on labels.
  • Group By: Powerful groupby functionality for splitting, applying, and combining data sets.
  • Time Series Functionality: Robust tools for working with time series data, including date range generation, frequency conversion, moving window statistics, and more.
  • Flexible I/O: Easy reading and writing of data from various file formats like CSV, Excel, SQL databases, HDF5, etc.

Getting Started: Installation

You can install Pandas using pip or conda.

Using pip:

pip install pandas

Using conda:

conda install pandas

Basic Concepts: Series and DataFrame

Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels (its index).

import pandas as pd
import numpy as np

# Creating a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

# Creating a Series with a custom index
s_indexed = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s_indexed)

DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used Pandas object.

import pandas as pd
import numpy as np

# Creating a DataFrame from a dictionary of lists
data = {
    'col1': [1, 2, 3, 4],
    'col2': ['A', 'B', 'C', 'D'],
    'col3': [True, False, True, False]
}
df = pd.DataFrame(data)
print(df)

# Creating a DataFrame with a date index
dates = pd.date_range('20230101', periods=6)
df2 = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df2)

# Viewing data
print(df.head(2)) # First 2 rows
print(df.tail(1)) # Last 1 row
print(df.index)
print(df.columns)
print(df.describe()) # Statistical summary
print(df.T) # Transpose

Further Topics:

  • Data Loading and Saving (CSV, Excel, etc.)
  • Selection, Indexing, and Slicing Data
  • Handling Missing Data
  • Group By Operations
  • Merging, Joining, and Concatenating DataFrames
  • Time Series Analysis
  • Performance Optimization

This document provides a basic introduction to Pandas. More detailed topics, advanced techniques, and practical examples will be covered in subsequent files.