PyTorch: Optimizers and Loss Functions

Once you have defined your neural network architecture using nn.Module, the next critical steps in training are defining a loss function (also known as a criterion) to quantify the error between your model's predictions and the true targets, and an optimizer to adjust your model's parameters based on the gradients computed by autograd to minimize this loss.

Loss Functions (`torch.nn` module)

Loss functions (or objective functions) measure how well your model performs given a set of parameters. PyTorch provides a variety of common loss functions in the torch.nn module.

Common Loss Functions:

nn.MSELoss (Mean Squared Error Loss):
- Used for regression tasks.
- Calculates the mean of the squared differences between predictions and targets.
- loss = (y_pred - y_true)^2 ```python import torch import torch.nn as nn
y_pred = torch.tensor([0.5, 2.0, 3.5]) y_true = torch.tensor([1.0, 2.5, 3.0]) mse_loss = nn.MSELoss() loss = mse_loss(y_pred, y_true) print(f"MSE Loss: {loss.item()}") # Output: 0.25 ( (0.5-1)^2 + (2-2.5)^2 + (3.5-3)^2 ) / 3 = (0.25 + 0.25 + 0.25) / 3 = 0.25 ```
nn.L1Loss (Mean Absolute Error Loss):
- Also for regression tasks.
- Calculates the mean of the absolute differences between predictions and targets.
- loss = |y_pred - y_true| ```python import torch import torch.nn as nn
y_pred = torch.tensor([0.5, 2.0, 3.5]) y_true = torch.tensor([1.0, 2.5, 3.0]) l1_loss = nn.L1Loss() loss = l1_loss(y_pred, y_true) print(f"L1 Loss: {loss.item()}") # Output: 0.5 ( (0.5+0.5+0.5) / 3 = 0.5 ) ```
nn.CrossEntropyLoss:
- Commonly used for multi-class classification problems.
- Combines nn.LogSoftmax and nn.NLLLoss (Negative Log Likelihood Loss) in one single class.
- Expects raw, unnormalized scores (logits) from the model's last layer and integer class labels as targets. ```python import torch import torch.nn as nn
Example: 3 classes, batch size 2

Logits (raw scores) for each class for each sample in the batch

y_pred_logits = torch.tensor([[0.1, 0.9, 0.0], [0.8, 0.1, 0.1]])

True labels (class indices)

y_true_labels = torch.tensor([1, 0], dtype=torch.long) # First sample is class 1, second is class 0

cross_entropy_loss = nn.CrossEntropyLoss() loss = cross_entropy_loss(y_pred_logits, y_true_labels) print(f"Cross Entropy Loss: {loss.item()}") ```
nn.BCELoss (Binary Cross Entropy Loss):
- Used for binary classification.
- Expects probabilities (values between 0 and 1) from the model's output and binary targets (0 or 1).
- Typically used after a sigmoid activation in the output layer. ```python import torch import torch.nn as nn
Probabilities for class 1

y_pred_proba = torch.tensor([0.9, 0.2, 0.8, 0.1])

True binary labels

y_true_binary = torch.tensor([1.0, 0.0, 1.0, 0.0])

bce_loss = nn.BCELoss() loss = bce_loss(y_pred_proba, y_true_binary) print(f"BCE Loss: {loss.item()}") ```
nn.BCEWithLogitsLoss:
- More numerically stable version of BCELoss for binary classification.
- Combines sigmoid and BCELoss into one operation.
- Expects raw, unnormalized scores (logits) from the model's output layer. ```python import torch import torch.nn as nn
Logits (raw scores)

y_pred_logits = torch.tensor([5.0, -2.0, 4.0, -3.0])

True binary labels

y_true_binary = torch.tensor([1.0, 0.0, 1.0, 0.0])

bce_logits_loss = nn.BCEWithLogitsLoss() loss = bce_logits_loss(y_pred_logits, y_true_binary) print(f"BCEWithLogits Loss: {loss.item()}") ```

Optimizers (`torch.optim` module)

Optimizers are algorithms used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses. torch.optim provides various optimization algorithms.

Common Optimizers:

optim.SGD (Stochastic Gradient Descent):
- The most basic optimizer, updates parameters in the direction opposite to the gradient of the loss function.
- params -= learning_rate * grad
- Can be augmented with momentum for faster convergence and weight_decay for L2 regularization. ```python import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

```
optim.Adam (Adaptive Moment Estimation):
- An adaptive learning rate optimization algorithm that combines ideas from RMSprop and momentum.
- Generally a good default choice and often converges faster. ```python import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

```
optim.Adagrad, optim.RMSprop, optim.Adadelta: Other adaptive learning rate optimizers.

How to Use an Optimizer

Instantiate the optimizer: Pass the model's parameters (model.parameters()) and a learning rate.
Zero the gradients: Before each backpropagation step, you need to clear the gradients of all optimized tensors.
Backward pass: Compute gradients using loss.backward().
Step the optimizer: Update the model's parameters using optimizer.step().

Example: Training Loop with Optimizer and Loss

Let's combine concepts from nn.Module, autograd, loss functions, and optimizers into a basic training loop.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

# Define a simple model (reusing SimpleNN from pytorch_nn_module.md)
class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Hyperparameters
input_size = 10    # Example: 10 features
hidden_size = 20
num_classes = 2    # Binary classification
learning_rate = 0.01
num_epochs = 100
batch_size = 16

# Dummy data
X_train = torch.randn(100, input_size) # 100 samples, 10 features
y_train = torch.randint(0, num_classes, (100,), dtype=torch.long) # 100 labels (0 or 1)

# Instantiate model, loss, and optimizer
model = SimpleNN(input_size, hidden_size, num_classes)
criterion = nn.CrossEntropyLoss() # For multi-class classification
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

print("Starting Training...")
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(X_train)
    loss = criterion(outputs, y_train)

    # Backward and optimize
    optimizer.zero_grad() # Clear previous gradients
    loss.backward()       # Compute gradients
    optimizer.step()      # Update weights

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print("Training Finished.")

# Evaluate (simple evaluation on training data)
with torch.no_grad():
    outputs = model(X_train)
    _, predicted = torch.max(outputs.data, 1)
    total = y_train.size(0)
    correct = (predicted == y_train).sum().item()
    print(f'Accuracy on training data: {100 * correct / total:.2f}%')

Further Topics:

Learning Rate Schedulers (e.g., optim.lr_scheduler)
Custom Loss Functions
Gradient Clipping
Zeroing gradients vs. setting to None
Weight Decay and Regularization

This document explains the essential components of PyTorch for defining how your model learns: loss functions and optimizers. These are foundational for developing any deep learning application.

PyTorch: Optimizers and Loss Functions

Loss Functions (torch.nn module)

Common Loss Functions:

Example: 3 classes, batch size 2

Logits (raw scores) for each class for each sample in the batch

True labels (class indices)

Probabilities for class 1

True binary labels

Logits (raw scores)