Pandas: Handling Missing Data
Missing data is a common issue in real-world datasets, and Pandas provides excellent tools for identifying, evaluating, and handling it. Missing values are typically represented as NaN (Not a Number) for numerical data and None for object-type data, both of which Pandas generally treats as equivalent to NaN.
1. Identifying Missing Data
Pandas offers isnull() (or isna()) and notnull() (or notna()) methods to detect missing values.
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 7, 8, 9, 10],
'C': [11, 12, 13, np.nan, 15],
'D': [16, 17, 18, 19, 20]
}
df = pd.DataFrame(data)
print("Original DataFrame with NaNs:\n", df)
# Check for missing values (returns boolean DataFrame)
print("\nIs null (boolean):\n", df.isnull())
# Check for non-missing values (returns boolean DataFrame)
print("\nNot null (boolean):\n", df.notnull())
# Count missing values per column
print("\nMissing values per column:\n", df.isnull().sum())
# Count missing values in the entire DataFrame
print("\nTotal missing values in DataFrame:", df.isnull().sum().sum())
# Check if any missing values exist in DataFrame
print("\nAre there any missing values?", df.isnull().any().any())
2. Dropping Missing Data (dropna())
The dropna() method allows you to remove rows or columns that contain missing values.
import pandas as pd
import numpy as np
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 7, 8, 9, 10],
'C': [11, 12, 13, np.nan, 15],
'D': [16, 17, 18, 19, 20]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# Drop rows with ANY missing values (default: axis=0, how='any')
df_dropped_rows_any = df.dropna()
print("\nDataFrame after dropping rows with any NaN:\n", df_dropped_rows_any)
# Drop rows with ALL missing values (how='all')
df_all_nan_row = df.copy()
df_all_nan_row.loc[5] = [np.nan, np.nan, np.nan, 21] # Add a row with mostly NaNs
df_all_nan_row.loc[6] = [np.nan, np.nan, np.nan, np.nan] # Add a row with all NaNs
print("\nDataFrame with an all-NaN row:\n", df_all_nan_row)
df_dropped_rows_all = df_all_nan_row.dropna(how='all')
print("\nDataFrame after dropping rows with ALL NaN:\n", df_dropped_rows_all)
# Drop columns with ANY missing values (axis=1)
df_dropped_cols_any = df.dropna(axis=1)
print("\nDataFrame after dropping columns with any NaN:\n", df_dropped_cols_any)
# Drop columns with ALL missing values (axis=1, how='all')
df_all_nan_col = df.copy()
df_all_nan_col['E'] = np.nan # Add an all-NaN column
print("\nDataFrame with an all-NaN column:\n", df_all_nan_col)
df_dropped_cols_all = df_all_nan_col.dropna(axis=1, how='all')
print("\nDataFrame after dropping columns with ALL NaN:\n", df_dropped_cols_all)
# Keep rows that have at least 'thresh' non-NaN values
df_thresh = df.dropna(thresh=3) # Keep row if it has at least 3 non-NaN values
print("\nDataFrame after dropping rows with less than 3 non-NaN values:\n", df_thresh)
# `inplace=True` modifies the DataFrame directly
# df.dropna(inplace=True)
3. Filling Missing Data (fillna())
The fillna() method allows you to replace missing values with a specified value or strategy. This is often preferred over dropping data, especially if you have many missing values.
import pandas as pd
import numpy as np
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 7, 8, 9, 10],
'C': [11, 12, 13, np.nan, 15],
'D': [16, 17, 18, 19, 20]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# Fill with a scalar value
df_filled_zero = df.fillna(0)
print("\nDataFrame after filling NaN with 0:\n", df_filled_zero)
# Fill with the mean of the column
df_filled_mean = df.copy() # Work on a copy
df_filled_mean['A'] = df_filled_mean['A'].fillna(df_filled_mean['A'].mean())
df_filled_mean['B'] = df_filled_mean['B'].fillna(df_filled_mean['B'].mean())
df_filled_mean['C'] = df_filled_mean['C'].fillna(df_filled_mean['C'].mean())
print("\nDataFrame after filling NaN with column mean:\n", df_filled_mean)
# Fill with the median of the column
df_filled_median = df.copy()
df_filled_median['B'] = df_filled_median['B'].fillna(df_filled_median['B'].median())
print("\nDataFrame after filling 'B' with column median:\n", df_filled_median)
# Forward fill (ffill / pad): propagate last valid observation forward to next valid
df_ffill = df.fillna(method='ffill')
print("\nDataFrame after forward fill:\n", df_ffill)
# Backward fill (bfill / backfill): propagate next valid observation backward to previous valid
df_bfill = df.fillna(method='bfill')
print("\nDataFrame after backward fill:\n", df_bfill)
# Fill specific columns with different values/strategies
df_filled_custom = df.copy()
df_filled_custom['A'].fillna(99, inplace=True)
df_filled_custom['B'].fillna(df_filled_custom['B'].mean(), inplace=True)
print("\nDataFrame after custom fills:\n", df_filled_custom)
# Using `limit` with ffill/bfill
df_limited_ffill = df.fillna(method='ffill', limit=1) # Only fill one consecutive NaN
print("\nDataFrame after forward fill with limit=1:\n", df_limited_ffill)
4. Interpolation (interpolate())
The interpolate() method can estimate missing values based on surrounding valid data points, often using various interpolation techniques. This is useful for numerical data and time series.
import pandas as pd
import numpy as np
s = pd.Series([0, 1, np.nan, np.nan, 4, 5, 6, np.nan, 9])
print("Original Series with NaNs:\n", s)
# Default linear interpolation
s_linear_interp = s.interpolate()
print("\nSeries after linear interpolation:\n", s_linear_interp)
# Polynomial interpolation (e.g., order=2)
# s_poly_interp = s.interpolate(method='polynomial', order=2)
# print("\nSeries after polynomial interpolation (order 2):\n", s_poly_interp)
# Time-weighted interpolation (requires datetime index)
idx = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-04', '2023-01-05',
'2023-01-07', '2023-01-08', '2023-01-09', '2023-01-11', '2023-01-12'])
s_time = pd.Series([0, 1, np.nan, np.nan, 4, 5, 6, np.nan, 9], index=idx)
s_time_interp = s_time.interpolate(method='time')
print("\nSeries (with datetime index) after time interpolation:\n", s_time_interp)
Further Topics:
- Dealing with different types of missing values (e.g., '?', '--') before converting to NaN.
- Advanced imputation techniques (e.g., using machine learning models for imputation).
- Understanding the implications of different missing data handling strategies.
Mastering missing data handling is a critical step in preparing clean and reliable datasets for analysis and machine learning models.