Pandas: Data Structures and Basic Operations
Pandas introduces two primary data structures: Series and DataFrame. Understanding these is fundamental to effectively using Pandas for data manipulation and analysis.
1. Series
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It is very similar to a column in a spreadsheet or a SQL table. Each value in a Series has an associated label, called its index.
Creating a Series
import pandas as pd
import numpy as np
# From a list (index will be 0, 1, 2, ...)
s1 = pd.Series([1, 2, 3, 4])
print("Series from list:\n", s1)
# From a list with a custom index
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("\nSeries with custom index:\n", s2)
# From a dictionary (keys become index, values become data)
data = {'apple': 10, 'banana': 20, 'cherry': 15}
s3 = pd.Series(data)
print("\nSeries from dictionary:\n", s3)
# From a scalar value (index must be provided)
s4 = pd.Series(5, index=['x', 'y', 'z'])
print("\nSeries from scalar:\n", s4)
# Handling missing data (NaN)
s5 = pd.Series([1, 3, np.nan, 6, 8])
print("\nSeries with NaN:\n", s5)
Series Attributes and Methods
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'], name='My_Series')
print("Values:", s.values)
print("Index:", s.index)
print("Data type:", s.dtype)
print("Name:", s.name)
print("Size:", s.size)
print("Is empty:", s.empty)
Accessing Series Elements
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
# By label
print("s['b']:", s['b'])
# By position (integer-location based indexing)
print("s[1]:", s[1])
# Slicing by label (inclusive of end label)
print("s['a':'c']:\n", s['a':'c'])
# Slicing by position (exclusive of end position)
print("s[0:2]:\n", s[0:2])
# Boolean indexing
print("s[s > 25]:\n", s[s > 25])
2. DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet, a SQL table, or a dictionary of Series objects. It is the most commonly used Pandas object.
Creating a DataFrame
import pandas as pd
import numpy as np
# From a dictionary of lists (keys become column names)
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'Paris', 'London', 'Berlin']
}
df1 = pd.DataFrame(data)
print("DataFrame from dictionary of lists:\n", df1)
# From a dictionary of Series
data_series = {
'col1': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'col2': pd.Series([4, 5, 6, 7], index=['a', 'b', 'c', 'd'])
}
df2 = pd.DataFrame(data_series) # NaN will be introduced where indices don't match
print("\nDataFrame from dictionary of Series:\n", df2)
# From a list of dictionaries (each dictionary is a row)
records = [
{'Name': 'Eve', 'Age': 22, 'City': 'Rome'},
{'Name': 'Frank', 'Age': 40, 'City': 'Madrid'}
]
df3 = pd.DataFrame(records)
print("\nDataFrame from list of dictionaries:\n", df3)
# From a NumPy array (requires column names)
arr_data = np.array([[10, 20, 30], [40, 50, 60]])
df4 = pd.DataFrame(arr_data, columns=['ColA', 'ColB', 'ColC'], index=['Row1', 'Row2'])
print("\nDataFrame from NumPy array:\n", df4)
Viewing Data
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Age': [25, 30, 35, 28, 22, 40],
'City': ['New York', 'Paris', 'London', 'Berlin', 'Rome', 'Madrid'],
'Salary': [50000, 60000, 75000, 55000, 45000, 80000]
}
df = pd.DataFrame(data)
print("First 3 rows:\n", df.head(3))
print("\nLast 2 rows:\n", df.tail(2))
print("\nDataFrame info (data types, non-null counts):\n")
df.info()
print("\nDescriptive statistics:\n", df.describe())
print("\nDataFrame shape:", df.shape) # (rows, columns)
print("\nColumn names:", df.columns)
print("\nRow index:", df.index)
Selecting Columns
import pandas as pd
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data)
# Selecting a single column (returns a Series)
print("Single column 'col1':\n", df['col1'])
print("Type of df['col1']:", type(df['col1']))
# Selecting multiple columns (returns a DataFrame)
print("\nMultiple columns ['col1', 'col2']:\n", df[['col1', 'col2']])
print("Type of df[['col1', 'col2']]:", type(df[['col1', 'col2']]))
Further Topics:
- Indexing and Slicing with
locandiloc - Adding/Deleting Columns
- Conditional Selection
- Handling Missing Data (
dropna,fillna) - Data Type Conversion (
astype)
This document lays the groundwork for understanding Pandas by introducing its core data structures and basic ways to create and inspect them. The next steps involve advanced data selection, manipulation, and cleaning.