Phase 2: Core Machine Learning

Core Machine Learning involves training algorithms to find hidden patterns in data. These models are generally faster to train, highly interpretable, and perform exceptionally well on structured tabular data (like databases and spreadsheets) compared to Deep Learning models.

1. Supervised Learning (Regression)

Predicting a continuous numerical value (e.g., predicting house prices, temperature, stock prices).

1.1 Linear Regression

The simplest algorithm. It draws a straight line (or hyperplane in multidimensional space) through data points to minimize the sum of squared errors between the predicted line and actual data points.

Example 1: Simple Linear Regression using Scikit-Learn

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# 1. Generate synthetic housing data (1 feature: square footage)
np.random.seed(42)
sqft = np.random.normal(1500, 500, 100).reshape(-1, 1) # Features must be 2D array
price = sqft * 150 + np.random.normal(0, 50000, 100).reshape(-1, 1) # Price = 150 * sqft + noise

# 2. Initialize and Train ('fit') the model
model = LinearRegression()
model.fit(sqft, price)

# 3. Make Predictions
predicted_prices = model.predict(sqft)

# 4. Evaluate
print(f"R-Squared (Variance explained): {r2_score(price, predicted_prices):.3f}")
print(f"Learned Equation: Price = {model.coef_[0][0]:.0f} * Sqft + {model.intercept_[0]:.0f}")

# (Optional Plotting logic omitted for brevity, but you would see a line cutting through scatter plot)

1.2 Ridge and Lasso Regression (Regularization)

Standard linear regression fails if you have too many features that are correlated (multicollinearity) or if you have more features than data points. Regularization adds a "penalty" factor to the math. - Ridge (L2 Penalty): Shrinks the coefficients of less important features towards zero (but never exactly zero). - Lasso (L1 Penalty): Can shrink coefficients to exactly zero, effectively performing automatic Feature Selection.

Example 2: Lasso Regression for Feature Selection

from sklearn.linear_model import Lasso
import numpy as np

# Imagine 3 features, but 2 are highly predictive and 1 is total noise
X = np.random.rand(100, 3) 
# Target y only depends on feature 0 and 1. Feature 2 is ignored.
y = 3 * X[:, 0] + 5 * X[:, 1] + 0 * X[:, 2] + np.random.randn(100) * 0.1

# Initialize Lasso with an alpha (penalty strength) parameter
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X, y)

print("Lasso Coefficients:", lasso_model.coef_)
# Notice how the 3rd coefficient becomes exactly 0.0. The model learned to ignore it!

2. Supervised Learning (Classification)

Predicting a categorical class label (e.g., Is this email Spam or Not Spam? Is this tumor Malignant or Benign?).

2.1 Logistic Regression

Despite the name, this is a classification algorithm. It uses a Sigmoid function to output a probability between 0 and 1. If probability > 0.5, classify as Class A, else Class B.

Example 3: Logistic Regression with Cross-Validation

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

# 1. Load a real dataset
data = load_breast_cancer()
X = data.data
y = data.target

# 2. Scale the data (Logistic Regression is sensitive to data scale)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Initialize Model
log_reg = LogisticRegression(max_iter=1000)

# 4. Perform 5-Fold Cross Validation
# This splits the data into 5 chunks, trains on 4, tests on 1, and rotates.
cv_scores = cross_val_score(log_reg, X_scaled, y, cv=5, scoring='accuracy')

print("Cross-Validation Accuracies:", cv_scores)
print(f"Mean Accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")

2.2 Ensemble Methods (Industry Standard)

Instead of training one complex model that might overfit, train hundreds of simple models (decision trees) and let them vote. - Random Forest: Trains hundreds of deep trees on random subsets of the data. High performance, hard to overfit. - Gradient Boosting (XGBoost/LightGBM): Trains shallow trees sequentially. Tree #2 focuses entirely on the mistakes made by Tree #1. These algorithms win almost all Kaggle competitions for tabular data.

Example 4: XGBoost for Classification

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_wine

# 1. Load Data (Wine classification dataset - 3 classes)
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.2, random_state=42)

# 2. Initialize XGBoost
# objective 'multi:softprob' is required for multi-class classification
model = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=4,
    objective='multi:softprob',
    random_state=42
)

# 3. Train
model.fit(X_train, y_train)

# 4. Predict & Evaluate
predictions = model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, predictions, target_names=wine.target_names))

# Print Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))

3. Unsupervised Learning

In unsupervised learning, we only have 'X' (features) and no 'y' (labels). The goal is to discover hidden structures.

3.1 Clustering

Grouping similar data points together. Useful for customer segmentation.

Example 5: K-Means Clustering

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np

# Generate synthetic clusters
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Initialize K-Means asking for 4 clusters
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)

# Fit and Predict which cluster each point belongs to
cluster_labels = kmeans.fit_predict(X)

print("First 10 point cluster assignments:", cluster_labels[:10])
print("Coordinates of the 4 cluster centers:\n", kmeans.cluster_centers_)

3.2 Dimensionality Reduction (PCA)

Compressing datasets with hundreds of features down to 2 or 3 features while retaining the maximum variance in the data. Extremely useful for visualizing high-dimensional data or reducing noise before feeding the data to another ML model.

Example 6: Principal Component Analysis (PCA)

from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

# Load images of handwritten digits (8x8 pixels = 64 features)
digits = load_digits()
X = digits.data

print(f"Original Data Shape: {X.shape} (64 dimensions)")

# Compress 64 dimensions down to just 2 dimensions for 2D plotting
pca = PCA(n_components=2)
X_compressed = pca.fit_transform(X)

print(f"Compressed Data Shape: {X_compressed.shape} (2 dimensions)")

# We calculate how much of the original "information" (variance) was kept
variance_ratio = pca.explained_variance_ratio_
print(f"Information kept in Component 1: {variance_ratio[0]*100:.1f}%")
print(f"Information kept in Component 2: {variance_ratio[1]*100:.1f}%")
print(f"Total variance explained with just 2 features: {sum(variance_ratio)*100:.1f}%")