Pandas: Data Cleaning and Preprocessing

Data cleaning and preprocessing are crucial steps in any data analysis or machine learning workflow. Raw data is often messy, inconsistent, and contains errors. Pandas provides a rich set of tools to address these issues, transforming raw data into a clean and structured format suitable for analysis.

This document will cover common data cleaning and preprocessing tasks, building upon the previous discussions on handling missing data.

1. Removing Duplicates

Duplicate rows or specific combinations of column values can skew analysis. Pandas provides duplicated() to identify and drop_duplicates() to remove them.

import pandas as pd

data = {
    'col1': ['A', 'B', 'A', 'C', 'B'],
    'col2': [1, 2, 1, 3, 2],
    'col3': ['X', 'Y', 'X', 'Z', 'Y']
}
df = pd.DataFrame(data)
print("Original DataFrame with duplicates:\n", df)

# Check for duplicate rows (returns a boolean Series)
print("\nDuplicate rows (full row):\n", df.duplicated())

# Drop duplicate rows (keeps the first occurrence by default)
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after dropping full duplicate rows:\n", df_no_duplicates)

# Drop duplicates based on a subset of columns
df_subset_duplicates = df.drop_duplicates(subset=['col1', 'col2'])
print("\nDataFrame after dropping duplicates based on 'col1' and 'col2':\n", df_subset_duplicates)

# Keep the last occurrence of duplicates
df_last_duplicates = df.drop_duplicates(keep='last')
print("\nDataFrame after dropping duplicates (keeping last):\n", df_last_duplicates)

2. Correcting Data Types

Data might be loaded with incorrect data types (e.g., numbers as strings). astype() is used for type conversion.

import pandas as pd

df = pd.DataFrame({
    'numeric_col': ['1', '2', '3', '4'],
    'date_col': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
    'mixed_col': [10, 'abc', 20, 'def']
})
print("Original DataFrame dtypes:\n", df.dtypes)

# Convert 'numeric_col' to integer
df['numeric_col'] = df['numeric_col'].astype(int)
print("\nDataFrame dtypes after converting 'numeric_col' to int:\n", df.dtypes)

# Convert 'date_col' to datetime objects
df['date_col'] = pd.to_datetime(df['date_col'])
print("\nDataFrame dtypes after converting 'date_col' to datetime:\n", df.dtypes)
print("Converted date_col values:\n", df['date_col'])

# Using pd.to_numeric for more robust conversion, handling errors
df_str_num = pd.DataFrame({'value': ['10', '20', 'error', '30']})
df_str_num['value_converted'] = pd.to_numeric(df_str_num['value'], errors='coerce') # 'coerce' replaces errors with NaN
print("\nDataFrame with mixed string/numeric values and coerced conversion:\n", df_str_num)

3. Renaming Columns

Clear and descriptive column names are important.

import pandas as pd

df = pd.DataFrame({
    'old_col_name_1': [1, 2],
    'old_col_name_2': [3, 4]
})
print("Original DataFrame column names:\n", df.columns)

# Rename a single column
df_renamed = df.rename(columns={'old_col_name_1': 'new_col_1'})
print("\nDataFrame after renaming one column:\n", df_renamed)

# Rename multiple columns
df_renamed_multi = df.rename(columns={'old_col_name_1': 'new_col_1', 'old_col_name_2': 'new_col_2'})
print("\nDataFrame after renaming multiple columns:\n", df_renamed_multi)

# Using a function to rename columns (e.g. to lowercase)
df_lower_cols = df.copy()
df_lower_cols.columns = df_lower_cols.columns.str.lower()
print("\nDataFrame with lowercase column names:\n", df_lower_cols)

4. Applying Functions

Applying custom functions or built-in Python functions to Series or DataFrames is a powerful way to transform data.

apply(): Can be used on Series or DataFrame (along an axis).
map(): Only for Series, maps values from one Series to another using a dictionary or Series.
applymap(): For DataFrames, applies a function element-wise.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col1': [1, 2, 3, 4],
    'col2': [10, 20, 30, 40],
    'col3': ['apple', 'banana', 'cherry', 'date']
})
print("Original DataFrame:\n", df)

# Apply a function to a Series (column)
df['col1_squared'] = df['col1'].apply(lambda x: x**2)
print("\nDataFrame with 'col1_squared':\n", df)

# Apply a function to multiple columns (axis=1 applies row-wise)
df['sum_cols_1_2'] = df[['col1', 'col2']].apply(np.sum, axis=1)
print("\nDataFrame with 'sum_cols_1_2':\n", df)

# Using map for value replacement
mapping = {'apple': 'fruit', 'banana': 'fruit', 'cherry': 'fruit', 'date': 'fruit'}
df['col3_category'] = df['col3'].map(mapping)
print("\nDataFrame with 'col3_category' using map:\n", df)

# Using applymap (element-wise for DataFrame)
df_numbers = df[['col1', 'col2']]
df_numbers_plus_one = df_numbers.applymap(lambda x: x + 1)
print("\nDataFrame with numbers + 1 using applymap:\n", df_numbers_plus_one)

5. String Manipulation

String data often requires cleaning, such as removing whitespace, changing case, or extracting patterns. Pandas Series with string data have a .str accessor for string methods.

import pandas as pd

df = pd.DataFrame({
    'text_data': ['  Hello World  ', 'PyThon ', '  pandas  lib']
})
print("Original text_data:\n", df['text_data'])

# Remove leading/trailing whitespace
df['text_data_stripped'] = df['text_data'].str.strip()
print("\ntext_data after strip:\n", df['text_data_stripped'])

# Convert to lowercase
df['text_data_lower'] = df['text_data'].str.lower()
print("\ntext_data after lowercase:\n", df['text_data_lower'])

# Replace substrings
df['text_data_replace'] = df['text_data'].str.replace(' ', '_')
print("\ntext_data after replacing spaces with underscores:\n", df['text_data_replace'])

# Check for patterns
df['contains_pandas'] = df['text_data'].str.contains('pandas', case=False)
print("\ntext_data contains 'pandas' (case-insensitive):\n", df['contains_pandas'])

6. Categorical Data

Converting object columns with a limited number of unique values to the 'category' dtype can save memory and improve performance.

import pandas as pd

df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'New York', 'London'],
    'temperature': [20, 15, 22, 21, 14]
})
print("Original city column dtype:", df['city'].dtype)

# Convert to category dtype
df['city'] = df['city'].astype('category')
print("Converted city column dtype:", df['city'].dtype)
print("Categorical city values:\n", df['city'])

Further Topics:

Binning numerical data (pd.cut, pd.qcut)
One-hot encoding and label encoding for categorical data.
Feature scaling (normalization, standardization) - often done with Scikit-learn after Pandas preprocessing.
Handling outliers.
Time series specific cleaning (e.g., resampling).

Data cleaning and preprocessing is an iterative process. These Pandas tools empower you to tackle a wide range of data quality issues, making your data ready for deeper analysis and model building.