Scikit-learn: Model Selection and Evaluation
Once you've built a machine learning model, it's crucial to select the best model for your task and evaluate its performance reliably. Scikit-learn provides extensive tools for model selection (e.g., hyperparameter tuning, cross-validation) and a wide array of metrics for evaluating different types of models.
1. Cross-Validation
Cross-validation is a technique for assessing how the results of a statistical analysis (e.g., a machine learning model) generalize to an independent data set. It is mainly used in settings where the goal is to predict and one wants to estimate how accurately a predictive model will perform in practice.
K-Fold Cross-Validation:
Splits the dataset into k equally sized folds. The model is trained on k-1 folds and validated on the remaining 1 fold. This process is repeated k times, with each fold serving as the validation set exactly once.
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Instantiate a model
model = LogisticRegression(solver='liblinear', random_state=42)
# Perform K-Fold Cross-Validation
# cv=5 means 5 folds
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean CV Accuracy: {scores.mean():.4f}")
print(f"Std Dev of CV Accuracy: {scores.std():.4f}")
# Example with a specific KFold object
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kf = cross_val_score(model, X, y, cv=kf)
print(f"\nMean CV Accuracy (with KFold object): {scores_kf.mean():.4f}")
2. Hyperparameter Tuning
Hyperparameters are parameters whose values are set before the learning process begins (e.g., n_neighbors in KNN, C in SVM, max_depth in Decision Trees). Tuning them is crucial for optimal model performance.
Grid Search (GridSearchCV):
Exhaustively searches over a specified parameter grid for the best combination of hyperparameters.
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Define the parameter grid to search
param_grid = {
'C': [0.1, 1, 10, 100], # Regularization parameter
'kernel': ['linear', 'rbf'], # Kernel type
'gamma': ['scale', 'auto', 0.1, 1] # Kernel coefficient for 'rbf'
}
# Create a GridSearchCV object
# estimator: the model to tune
# param_grid: dictionary of hyperparameters to search
# cv: number of cross-validation folds
# scoring: metric to optimize
# verbose: level of verbosity (0=silent, 1=message, 2=all)
grid_search = GridSearchCV(SVC(random_state=42), param_grid, cv=3, scoring='accuracy', verbose=0)
# Fit GridSearchCV to the data
grid_search.fit(X, y)
# Print the best parameters and best score
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")
# Access the best model
best_svm_model = grid_search.best_estimator_
print(f"Best SVM model: {best_svm_model}")
Randomized Search (RandomizedSearchCV):
Randomly samples a fixed number of parameter settings from specified distributions. More efficient than Grid Search when the number of hyperparameters is large.
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from scipy.stats import randint, uniform
# Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target
# Define the parameter distributions to sample from
param_dist = {
'n_estimators': randint(50, 200), # Number of trees
'max_depth': randint(3, 10), # Maximum depth of trees
'min_samples_leaf': randint(1, 20), # Minimum samples required at a leaf node
'criterion': ['gini', 'entropy']
}
# Create a RandomizedSearchCV object
# n_iter: number of parameter settings that are sampled
random_search = RandomizedSearchCV(RandomForestClassifier(random_state=42),
param_distributions=param_dist,
n_iter=20, # Number of random combinations to try
cv=3,
scoring='accuracy',
random_state=42,
verbose=0)
# Fit RandomizedSearchCV to the data
random_search.fit(X, y)
# Print the best parameters and best score
print(f"\nBest parameters found (Randomized Search): {random_search.best_params_}")
print(f"Best cross-validation accuracy (Randomized Search): {random_search.best_score_:.4f}")
3. Evaluation Metrics
Choosing the right evaluation metric is crucial for understanding your model's performance in the context of your problem.
Classification Metrics:
- Accuracy: Proportion of correctly classified instances.
- Precision: Proportion of positive identifications that were actually correct.
- Recall (Sensitivity): Proportion of actual positives that were correctly identified.
- F1-score: Harmonic mean of precision and recall.
- Confusion Matrix: A table summarizing the performance of a classification algorithm.
- ROC AUC Score: Area Under the Receiver Operating Characteristic curve. Useful for assessing binary classification models.
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import seaborn as sns
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1] # Probabilities of the positive class
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_proba))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Plot ROC curve
графии, tpr, thresholds = roc_curve(y_test, y_proba)
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc_score(y_test, y_proba):.2f})')
plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
Regression Metrics:
- Mean Squared Error (MSE): Average of the squared differences between predicted and actual values.
- Root Mean Squared Error (RMSE): Square root of MSE.
- Mean Absolute Error (MAE): Average of the absolute differences.
- R-squared (Coefficient of Determination): Proportion of the variance in the dependent variable that is predictable from the independent variables.
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("\nMSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("R-squared:", r2_score(y_test, y_pred))
Further Topics:
- Nested Cross-Validation
- Learning Curves
- Validation Curves
- Different types of cross-validation (e.g.,
StratifiedKFold,GroupKFold,TimeSeriesSplit) - Model persistence (saving and loading models using
jobliborpickle)
Model selection and evaluation are iterative processes. A thorough understanding of these techniques is essential for developing reliable and high-performing machine learning solutions.