PyTorch: Optimizers and Loss Functions
Once you have defined your neural network architecture using nn.Module, the next critical steps in training are defining a loss function (also known as a criterion) to quantify the error between your model's predictions and the true targets, and an optimizer to adjust your model's parameters based on the gradients computed by autograd to minimize this loss.
Loss Functions (torch.nn module)
Loss functions (or objective functions) measure how well your model performs given a set of parameters. PyTorch provides a variety of common loss functions in the torch.nn module.
Common Loss Functions:
-
nn.MSELoss(Mean Squared Error Loss):- Used for regression tasks.
- Calculates the mean of the squared differences between predictions and targets.
loss = (y_pred - y_true)^2```python import torch import torch.nn as nn
y_pred = torch.tensor([0.5, 2.0, 3.5]) y_true = torch.tensor([1.0, 2.5, 3.0]) mse_loss = nn.MSELoss() loss = mse_loss(y_pred, y_true) print(f"MSE Loss: {loss.item()}") # Output: 0.25 ( (0.5-1)^2 + (2-2.5)^2 + (3.5-3)^2 ) / 3 = (0.25 + 0.25 + 0.25) / 3 = 0.25 ```
-
nn.L1Loss(Mean Absolute Error Loss):- Also for regression tasks.
- Calculates the mean of the absolute differences between predictions and targets.
loss = |y_pred - y_true|```python import torch import torch.nn as nn
y_pred = torch.tensor([0.5, 2.0, 3.5]) y_true = torch.tensor([1.0, 2.5, 3.0]) l1_loss = nn.L1Loss() loss = l1_loss(y_pred, y_true) print(f"L1 Loss: {loss.item()}") # Output: 0.5 ( (0.5+0.5+0.5) / 3 = 0.5 ) ```
-
nn.CrossEntropyLoss:- Commonly used for multi-class classification problems.
- Combines
nn.LogSoftmaxandnn.NLLLoss(Negative Log Likelihood Loss) in one single class. - Expects raw, unnormalized scores (logits) from the model's last layer and integer class labels as targets. ```python import torch import torch.nn as nn
Example: 3 classes, batch size 2
Logits (raw scores) for each class for each sample in the batch
y_pred_logits = torch.tensor([[0.1, 0.9, 0.0], [0.8, 0.1, 0.1]])
True labels (class indices)
y_true_labels = torch.tensor([1, 0], dtype=torch.long) # First sample is class 1, second is class 0
cross_entropy_loss = nn.CrossEntropyLoss() loss = cross_entropy_loss(y_pred_logits, y_true_labels) print(f"Cross Entropy Loss: {loss.item()}") ```
-
nn.BCELoss(Binary Cross Entropy Loss):- Used for binary classification.
- Expects probabilities (values between 0 and 1) from the model's output and binary targets (0 or 1).
- Typically used after a
sigmoidactivation in the output layer. ```python import torch import torch.nn as nn
Probabilities for class 1
y_pred_proba = torch.tensor([0.9, 0.2, 0.8, 0.1])
True binary labels
y_true_binary = torch.tensor([1.0, 0.0, 1.0, 0.0])
bce_loss = nn.BCELoss() loss = bce_loss(y_pred_proba, y_true_binary) print(f"BCE Loss: {loss.item()}") ```
-
nn.BCEWithLogitsLoss:- More numerically stable version of
BCELossfor binary classification. - Combines
sigmoidandBCELossinto one operation. - Expects raw, unnormalized scores (logits) from the model's output layer. ```python import torch import torch.nn as nn
Logits (raw scores)
y_pred_logits = torch.tensor([5.0, -2.0, 4.0, -3.0])
True binary labels
y_true_binary = torch.tensor([1.0, 0.0, 1.0, 0.0])
bce_logits_loss = nn.BCEWithLogitsLoss() loss = bce_logits_loss(y_pred_logits, y_true_binary) print(f"BCEWithLogits Loss: {loss.item()}") ```
- More numerically stable version of
Optimizers (torch.optim module)
Optimizers are algorithms used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses. torch.optim provides various optimization algorithms.
Common Optimizers:
-
optim.SGD(Stochastic Gradient Descent):- The most basic optimizer, updates parameters in the direction opposite to the gradient of the loss function.
params -= learning_rate * grad- Can be augmented with
momentumfor faster convergence andweight_decayfor L2 regularization. ```python import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
```
-
optim.Adam(Adaptive Moment Estimation):- An adaptive learning rate optimization algorithm that combines ideas from RMSprop and momentum.
- Generally a good default choice and often converges faster. ```python import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
```
-
optim.Adagrad,optim.RMSprop,optim.Adadelta: Other adaptive learning rate optimizers.
How to Use an Optimizer
- Instantiate the optimizer: Pass the model's parameters (
model.parameters()) and a learning rate. - Zero the gradients: Before each backpropagation step, you need to clear the gradients of all optimized tensors.
- Backward pass: Compute gradients using
loss.backward(). - Step the optimizer: Update the model's parameters using
optimizer.step().
Example: Training Loop with Optimizer and Loss
Let's combine concepts from nn.Module, autograd, loss functions, and optimizers into a basic training loop.
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
# Define a simple model (reusing SimpleNN from pytorch_nn_module.md)
class SimpleNN(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
return out
# Hyperparameters
input_size = 10 # Example: 10 features
hidden_size = 20
num_classes = 2 # Binary classification
learning_rate = 0.01
num_epochs = 100
batch_size = 16
# Dummy data
X_train = torch.randn(100, input_size) # 100 samples, 10 features
y_train = torch.randint(0, num_classes, (100,), dtype=torch.long) # 100 labels (0 or 1)
# Instantiate model, loss, and optimizer
model = SimpleNN(input_size, hidden_size, num_classes)
criterion = nn.CrossEntropyLoss() # For multi-class classification
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
print("Starting Training...")
for epoch in range(num_epochs):
# Forward pass
outputs = model(X_train)
loss = criterion(outputs, y_train)
# Backward and optimize
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients
optimizer.step() # Update weights
if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
print("Training Finished.")
# Evaluate (simple evaluation on training data)
with torch.no_grad():
outputs = model(X_train)
_, predicted = torch.max(outputs.data, 1)
total = y_train.size(0)
correct = (predicted == y_train).sum().item()
print(f'Accuracy on training data: {100 * correct / total:.2f}%')
Further Topics:
- Learning Rate Schedulers (e.g.,
optim.lr_scheduler) - Custom Loss Functions
- Gradient Clipping
- Zeroing gradients vs. setting to
None - Weight Decay and Regularization
This document explains the essential components of PyTorch for defining how your model learns: loss functions and optimizers. These are foundational for developing any deep learning application.