Scikit-learn: Model Selection and Evaluation

Once you've built a machine learning model, it's crucial to select the best model for your task and evaluate its performance reliably. Scikit-learn provides extensive tools for model selection (e.g., hyperparameter tuning, cross-validation) and a wide array of metrics for evaluating different types of models.

1. Cross-Validation

Cross-validation is a technique for assessing how the results of a statistical analysis (e.g., a machine learning model) generalize to an independent data set. It is mainly used in settings where the goal is to predict and one wants to estimate how accurately a predictive model will perform in practice.

K-Fold Cross-Validation:

Splits the dataset into k equally sized folds. The model is trained on k-1 folds and validated on the remaining 1 fold. This process is repeated k times, with each fold serving as the validation set exactly once.

import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Instantiate a model
model = LogisticRegression(solver='liblinear', random_state=42)

# Perform K-Fold Cross-Validation
# cv=5 means 5 folds
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean CV Accuracy: {scores.mean():.4f}")
print(f"Std Dev of CV Accuracy: {scores.std():.4f}")

# Example with a specific KFold object
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kf = cross_val_score(model, X, y, cv=kf)
print(f"\nMean CV Accuracy (with KFold object): {scores_kf.mean():.4f}")

2. Hyperparameter Tuning

Hyperparameters are parameters whose values are set before the learning process begins (e.g., n_neighbors in KNN, C in SVM, max_depth in Decision Trees). Tuning them is crucial for optimal model performance.

Grid Search (`GridSearchCV`):

Exhaustively searches over a specified parameter grid for the best combination of hyperparameters.

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Define the parameter grid to search
param_grid = {
    'C': [0.1, 1, 10, 100],          # Regularization parameter
    'kernel': ['linear', 'rbf'],     # Kernel type
    'gamma': ['scale', 'auto', 0.1, 1] # Kernel coefficient for 'rbf'
}

# Create a GridSearchCV object
# estimator: the model to tune
# param_grid: dictionary of hyperparameters to search
# cv: number of cross-validation folds
# scoring: metric to optimize
# verbose: level of verbosity (0=silent, 1=message, 2=all)
grid_search = GridSearchCV(SVC(random_state=42), param_grid, cv=3, scoring='accuracy', verbose=0)

# Fit GridSearchCV to the data
grid_search.fit(X, y)

# Print the best parameters and best score
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

# Access the best model
best_svm_model = grid_search.best_estimator_
print(f"Best SVM model: {best_svm_model}")

Randomized Search (`RandomizedSearchCV`):

Randomly samples a fixed number of parameter settings from specified distributions. More efficient than Grid Search when the number of hyperparameters is large.

import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from scipy.stats import randint, uniform

# Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Define the parameter distributions to sample from
param_dist = {
    'n_estimators': randint(50, 200),  # Number of trees
    'max_depth': randint(3, 10),       # Maximum depth of trees
    'min_samples_leaf': randint(1, 20), # Minimum samples required at a leaf node
    'criterion': ['gini', 'entropy']
}

# Create a RandomizedSearchCV object
# n_iter: number of parameter settings that are sampled
random_search = RandomizedSearchCV(RandomForestClassifier(random_state=42),
                                   param_distributions=param_dist,
                                   n_iter=20, # Number of random combinations to try
                                   cv=3,
                                   scoring='accuracy',
                                   random_state=42,
                                   verbose=0)

# Fit RandomizedSearchCV to the data
random_search.fit(X, y)

# Print the best parameters and best score
print(f"\nBest parameters found (Randomized Search): {random_search.best_params_}")
print(f"Best cross-validation accuracy (Randomized Search): {random_search.best_score_:.4f}")

3. Evaluation Metrics

Choosing the right evaluation metric is crucial for understanding your model's performance in the context of your problem.

Classification Metrics:

Accuracy: Proportion of correctly classified instances.
Precision: Proportion of positive identifications that were actually correct.
Recall (Sensitivity): Proportion of actual positives that were correctly identified.
F1-score: Harmonic mean of precision and recall.
Confusion Matrix: A table summarizing the performance of a classification algorithm.
ROC AUC Score: Area Under the Receiver Operating Characteristic curve. Useful for assessing binary classification models.

import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import seaborn as sns

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
                           n_redundant=0, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1] # Probabilities of the positive class

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_proba))

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Plot ROC curve
графии, tpr, thresholds = roc_curve(y_test, y_proba)
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc_score(y_test, y_proba):.2f})')
plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Regression Metrics:

Mean Squared Error (MSE): Average of the squared differences between predicted and actual values.
Root Mean Squared Error (RMSE): Square root of MSE.
Mean Absolute Error (MAE): Average of the absolute differences.
R-squared (Coefficient of Determination): Proportion of the variance in the dependent variable that is predictable from the independent variables.

import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("\nMSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("R-squared:", r2_score(y_test, y_pred))

Further Topics:

Nested Cross-Validation
Learning Curves
Validation Curves
Different types of cross-validation (e.g., StratifiedKFold, GroupKFold, TimeSeriesSplit)
Model persistence (saving and loading models using joblib or pickle)

Model selection and evaluation are iterative processes. A thorough understanding of these techniques is essential for developing reliable and high-performing machine learning solutions.