Scikit-learn: Regression Algorithms

Regression is a supervised machine learning task where the goal is to predict a continuous numerical value. Scikit-learn offers a wide array of regression algorithms, suitable for different types of data and problem complexities. This document will introduce some common regression algorithms and their application using Scikit-learn.

1. Linear Regression

Linear Regression is one of the simplest and most fundamental regression algorithms. It models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import make_regression # For synthetic data

# 1. Generate synthetic dataset for regression
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)
y = y.flatten() # Ensure y is 1D

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Create a Linear Regression model
model = LinearRegression()

# 4. Train the model
model.fit(X_train, y_train)

# 5. Make predictions
y_pred = model.predict(X_test)

# 6. Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Linear Regression MSE: {mse:.2f}")
print(f"Linear Regression R2 Score: {r2:.2f}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

# 7. Visualize the results
plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual values')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted line')
plt.title('Linear Regression')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()

2. Ridge and Lasso Regression (Regularized Linear Models)

Ridge and Lasso Regression are linear regression techniques with added regularization to prevent overfitting and improve model generalization.

Ridge Regression (L2 regularization): Adds a penalty equal to the square of the magnitude of coefficients. This shrinks coefficients towards zero but doesn't eliminate them entirely.
Lasso Regression (L1 regularization): Adds a penalty equal to the absolute value of the magnitude of coefficients. This can lead to sparse models where some coefficients become exactly zero, effectively performing feature selection.

import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=10, n_informative=5, noise=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Ridge Regression
ridge_model = Ridge(alpha=1.0) # alpha is the regularization strength
ridge_model.fit(X_train, y_train)
ridge_pred = ridge_model.predict(X_test)
print(f"Ridge Regression MSE: {mean_squared_error(y_test, ridge_pred):.2f}")
print(f"Ridge Coefficients (first 5): {ridge_model.coef_[:5]}")

# Lasso Regression
lasso_model = Lasso(alpha=1.0) # alpha is the regularization strength
lasso_model.fit(X_train, y_train)
lasso_pred = lasso_model.predict(X_test)
print(f"Lasso Regression MSE: {mean_squared_error(y_test, lasso_pred):.2f}")
print(f"Lasso Coefficients (first 5): {lasso_model.coef_[:5]} (some might be zero)")

3. K-Nearest Neighbors (KNN) Regressor

KNN can also be used for regression. The predicted value for a new data point is the average (or weighted average) of the target values of its K nearest neighbors.

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

knn_regressor = KNeighborsRegressor(n_neighbors=5) # n_neighbors is K
knn_regressor.fit(X_train, y_train)
knn_pred = knn_regressor.predict(X_test)

print(f"KNN Regressor MSE: {mean_squared_error(y_test, knn_pred):.2f}")

4. Decision Tree Regressor

Decision trees can also be adapted for regression tasks. Instead of predicting a class label, they predict a numerical value. The predicted value for a leaf node is typically the average of the target values of the training samples that fall into that leaf.

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

tree_regressor = DecisionTreeRegressor(max_depth=5, random_state=0)
tree_regressor.fit(X_train, y_train)
tree_pred = tree_regressor.predict(X_test)

print(f"Decision Tree Regressor MSE: {mean_squared_error(y_test, tree_pred):.2f}")

5. Random Forest Regressor

Similar to the classification variant, Random Forest Regressor builds an ensemble of decision trees. For regression, the final prediction is the average of the predictions from all individual trees. It generally offers better accuracy and robustness than a single decision tree.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=10, n_informative=5, noise=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

forest_regressor = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=0)
forest_regressor.fit(X_train, y_train)
forest_pred = forest_regressor.predict(X_test)

print(f"Random Forest Regressor MSE: {mean_squared_error(y_test, forest_pred):.2f}")

Further Topics:

Gradient Boosting Regressors (e.g., GradientBoostingRegressor, XGBoost, LightGBM)
Support Vector Regression (SVR)
Polynomial Regression
Evaluation metrics for regression (R-squared, MAE, RMSE)
Hyperparameter tuning for regression models

This document provides an overview of several key regression algorithms available in Scikit-learn, demonstrating their basic usage. Choosing the right algorithm often depends on the nature of your data, the complexity of the underlying relationship, and the need for interpretability.