Scikit-learn: Classification Algorithms

Classification is a supervised machine learning task where the goal is to predict a categorical label (class) for a given input. Scikit-learn provides a wide range of classification algorithms, each with its strengths and weaknesses. This document will introduce some common classification algorithms and their usage in Scikit-learn.

1. Logistic Regression

Despite its name, Logistic Regression is a fundamental algorithm for binary classification. It models the probability that a given input belongs to a particular class. It can also be extended to multi-class classification (often using a "one-vs-rest" strategy).

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import make_classification # For synthetic data

# 1. Generate synthetic dataset for binary classification
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=0, n_classes=2, random_state=42)

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Create a Logistic Regression model
#    'solver' and 'penalty' are important hyperparameters
model = LogisticRegression(solver='liblinear', random_state=42)

# 4. Train the model
model.fit(X_train, y_train)

# 5. Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test) # Probability estimates

# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Example of predicting probability for a single sample
print("\nProbability for first test sample (Class 0, Class 1):", y_pred_proba[0])

2. K-Nearest Neighbors (KNN) Classifier

KNN is a non-parametric, lazy learning algorithm used for both classification and regression. In classification, a sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its K nearest neighbors.

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris # A classic dataset

# 1. Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Create a KNN Classifier model
#    n_neighbors is a key hyperparameter (K)
model = KNeighborsClassifier(n_neighbors=5)

# 4. Train the model
model.fit(X_train, y_train)

# 5. Make predictions
y_pred = model.predict(X_test)

# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Classifier Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))

3. Support Vector Machine (SVM) Classifier

SVMs are powerful and versatile machine learning models capable of performing linear or non-linear classification, regression, and even outlier detection. They work by finding the hyperplane that best separates classes in the feature space.

import numpy as np
from sklearn.svm import SVC # Support Vector Classifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_breast_cancer # A binary classification dataset

# 1. Load Breast Cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Create an SVM Classifier model
#    'kernel' (e.g., 'linear', 'rbf', 'poly') and 'C' are key hyperparameters
model = SVC(kernel='linear', random_state=42) # Using a linear kernel for simplicity

# 4. Train the model
model.fit(X_train, y_train)

# 5. Make predictions
y_pred = model.predict(X_test)

# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Classifier Accuracy (Linear Kernel): {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=cancer.target_names))

4. Decision Tree Classifier

Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_wine # A multi-class classification dataset

# 1. Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Create a Decision Tree Classifier model
#    'max_depth', 'min_samples_leaf', 'criterion' are important hyperparameters
model = DecisionTreeClassifier(max_depth=5, random_state=42)

# 4. Train the model
model.fit(X_train, y_train)

# 5. Make predictions
y_pred = model.predict(X_test)

# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Classifier Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=wine.target_names))

# Optional: Visualize the decision tree (requires graphviz)
# from sklearn.tree import plot_tree
# plt.figure(figsize=(20,10))
# plot_tree(model, filled=True, feature_names=wine.feature_names, class_names=wine.target_names.astype(str))
# plt.title("Decision Tree Visualization")
# plt.show()

5. Random Forest Classifier

Random Forest is an ensemble learning method for classification and regression that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It reduces overfitting compared to single decision trees.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_digits # A multi-class classification dataset

# 1. Load Digits dataset
digits = load_digits()
X, y = digits.data, digits.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Create a Random Forest Classifier model
#    'n_estimators', 'max_depth' are important hyperparameters
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# 4. Train the model
model.fit(X_train, y_train)

# 5. Make predictions
y_pred = model.predict(X_test)

# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=[str(i) for i in range(10)]))

Further Topics:

Gradient Boosting Classifiers (e.g., GradientBoostingClassifier, XGBoost, LightGBM)
Naive Bayes Classifiers
Ensemble methods (Bagging, Boosting, Stacking)
Evaluation metrics for classification (Precision, Recall, F1-score, ROC-AUC)
Handling imbalanced datasets

This document provides an overview of several key classification algorithms available in Scikit-learn, demonstrating their basic usage. Understanding these algorithms and their appropriate application is fundamental to supervised machine learning.