Scikit-learn: Preprocessing and Feature Engineering

Data preprocessing and feature engineering are critical steps in any machine learning workflow. Scikit-learn provides a comprehensive set of tools for transforming raw data into a format suitable for machine learning algorithms, which often perform better with well-prepared features.

1. Scaling Features

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This prevents features with larger values from dominating the learning process.

Common Scaling Techniques:

Standardization (StandardScaler): Rescales data to have a mean of 0 and a standard deviation of 1 (Z-score normalization).
Normalization (MinMaxScaler): Rescales data to a fixed range, usually 0 to 1.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Sample data
data = {'Feature1': [10, 20, 30, 40, 50],
        'Feature2': [0.1, 0.5, 0.9, 0.2, 0.7]}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# StandardScaler
scaler_standard = StandardScaler()
df_scaled_standard = pd.DataFrame(scaler_standard.fit_transform(df), columns=df.columns)
print("\nDataFrame after StandardScaler:\n", df_scaled_standard)
print(f"Mean of Feature1 (StandardScaler): {df_scaled_standard['Feature1'].mean():.2f}")
print(f"Std Dev of Feature1 (StandardScaler): {df_scaled_standard['Feature1'].std():.2f}")

# MinMaxScaler
scaler_minmax = MinMaxScaler()
df_scaled_minmax = pd.DataFrame(scaler_minmax.fit_transform(df), columns=df.columns)
print("\nDataFrame after MinMaxScaler:\n", df_scaled_minmax)
print(f"Min of Feature1 (MinMaxScaler): {df_scaled_minmax['Feature1'].min():.2f}")
print(f"Max of Feature1 (MinMaxScaler): {df_scaled_minmax['Feature1'].max():.2f}")

# Visualizing the effect of scaling (example for one feature)
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['Feature1'], kde=True)
plt.title('Original Feature1 Distribution')

plt.subplot(1, 2, 2)
sns.histplot(df_scaled_standard['Feature1'], kde=True)
plt.title('Standard Scaled Feature1 Distribution')
plt.tight_layout()
plt.show()

2. Encoding Categorical Features

Machine learning models typically require numerical input. Categorical features (e.g., 'red', 'green', 'blue') need to be converted into numerical representations.

Common Encoding Techniques:

One-Hot Encoding (OneHotEncoder): Creates new binary columns for each category. For example, 'color' with values 'red', 'green', 'blue' becomes three new columns: 'color_red', 'color_green', 'color_blue', with 0s and 1s.
Label Encoding (LabelEncoder): Assigns a unique integer to each category. Suitable for ordinal categories or target variables.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Sample data
df = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Red', 'Green'],
    'Size': ['S', 'M', 'L', 'S', 'M'],
    'Target_Class': ['Yes', 'No', 'Yes', 'No', 'Yes'] # Target variable for LabelEncoder
})
print("Original DataFrame:\n", df)

# One-Hot Encoding (for 'Color' and 'Size' features)
encoder_ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # sparse_output=False for dense array
ohe_features = encoder_ohe.fit_transform(df[['Color', 'Size']])
df_ohe = pd.DataFrame(ohe_features, columns=encoder_ohe.get_feature_names_out(['Color', 'Size']))
print("\nDataFrame after One-Hot Encoding:\n", df_ohe)

# Label Encoding (for 'Target_Class' target variable)
encoder_label = LabelEncoder()
df['Target_Class_Encoded'] = encoder_label.fit_transform(df['Target_Class'])
print("\nDataFrame after Label Encoding 'Target_Class':\n", df[['Target_Class', 'Target_Class_Encoded']])
print("Original classes:", encoder_label.classes_)

3. Imputing Missing Values

As discussed in Pandas, handling missing data is crucial. Scikit-learn provides SimpleImputer for more robust imputation strategies.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample data with missing values
data = {'Feature1': [1, 2, np.nan, 4, 5],
        'Feature2': [10, np.nan, 30, 40, 50],
        'Feature3': ['A', 'B', 'A', np.nan, 'C']}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Impute numerical features with the mean
imputer_mean = SimpleImputer(strategy='mean')
df['Feature1_imputed'] = imputer_mean.fit_transform(df[['Feature1']])
df['Feature2_imputed'] = imputer_mean.fit_transform(df[['Feature2']])

# Impute categorical features with the most frequent value
imputer_mf = SimpleImputer(strategy='most_frequent')
df['Feature3_imputed'] = imputer_mf.fit_transform(df[['Feature3']])

print("\nDataFrame after imputation:\n", df)

4. Dimensionality Reduction

Reducing the number of features can help combat the curse of dimensionality, reduce noise, and speed up training.

Principal Component Analysis (`PCA`):

Transforms high-dimensional data into a lower-dimensional space while retaining most of the variance.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris # A classic dataset

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a PCA model to reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_pca.shape}")

# Visualize the 2D projected data
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.title('PCA of Iris Dataset (2 Components)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(scatter, ticks=np.unique(y), label='Species')
plt.show()

5. Feature Construction/Generation

Creating new features from existing ones can significantly improve model performance.

Polynomial Features (PolynomialFeatures): Generates polynomial and interaction features (e.g., given features A and B, it creates A^2, B^2, A*B).

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

# Sample data
X = np.array([[1, 2], [3, 4]])
df_poly = pd.DataFrame(X, columns=['Feature_A', 'Feature_B'])
print("Original data:\n", df_poly)

# Generate polynomial features up to degree 2 (includes A, B, A^2, B^2, A*B, constant)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
df_poly_features = pd.DataFrame(X_poly, columns=poly.get_feature_names_out(['Feature_A', 'Feature_B']))
print("\nPolynomial Features (degree 2):\n", df_poly_features)

6. Pipelines

Scikit-learn Pipeline objects allow you to chain multiple preprocessing steps and a final estimator into a single object. This simplifies code, prevents data leakage (e.g., fitting scalers on test data), and makes the workflow more robust.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create a pipeline
# Steps: Polynomial Features -> Scaling -> Linear Regression
pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f"\nPipeline MSE: {mse:.2f}")

Further Topics:

ColumnTransformer for applying different transformations to different columns.
Feature selection methods (e.g., SelectKBest, RFE).
Text feature extraction (e.g., CountVectorizer, TfidfVectorizer).
Time series feature engineering.

Mastering data preprocessing and feature engineering with Scikit-learn is essential for building high-performing and robust machine learning models.