PyTorch Autograd: Automatic Differentiation

PyTorch's autograd package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, meaning that your backprop graph is defined by how your code is run, and every single iteration can be different. This flexibility is what makes PyTorch so powerful for building complex and dynamic neural networks.

Key Concepts:

torch.Tensor: The central class of PyTorch. If you set its .requires_grad attribute to True, it starts to track all operations on it. When you finish your computation, you can call .backward() to compute all gradients automatically.
torch.autograd.Function: Implements forward and backward definitions of an autograd operation.
Computational Graph: A directed acyclic graph (DAG) where nodes are operations and edges are tensors. When you call .backward(), autograd traverses this graph backward to compute gradients.

How Autograd Works

Every tensor has an attribute .grad_fn that references a Function that created the Tensor (except for Tensors created by the user - their grad_fn is None). This Function knows how to compute the gradients for its inputs given the gradients of its outputs.

Example: Basic Autograd

import torch

# Create a tensor and set requires_grad=True to track computation
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Perform some operations
a = x + y
b = x * y
c = a * b

print(f"x: {x}, y: {y}")
print(f"a = x + y = {a}")
print(f"b = x * y = {b}")
print(f"c = a * b = {c}")

# The grad_fn for a tensor holds the operation that created it
print(f"x.grad_fn: {x.grad_fn}") # None, as x is a leaf tensor created by user
print(f"a.grad_fn: {a.grad_fn}") # <AddBackward0 object at ...>
print(f"c.grad_fn: {c.grad_fn}") # <MulBackward0 object at ...>

# Compute gradients: c = (x + y) * (x * y) = x^2 * y + x * y^2
# dc/dx = 2xy + y^2
# dc/dy = x^2 + 2xy

# At x=2, y=3:
# dc/dx = 2*2*3 + 3^2 = 12 + 9 = 21
# dc/dy = 2^2 + 2*2*3 = 4 + 12 = 16

# Perform backpropagation
c.backward()

# Access the gradients
print(f"\nGradient of c with respect to x (dc/dx): {x.grad}")
print(f"Gradient of c with respect to y (dc/dy): {y.grad}")

Gradients with Vector-Jacobian Product

In general, backward() computes the sum of gradients, also known as the Jacobian-vector product. If a scalar value is not explicitly provided to backward() (i.e., c.backward()), it assumes that the tensor being backpropagated is a scalar (e.g., a loss function). If the tensor is non-scalar (e.g., a vector or matrix), you need to pass a gradient argument that is a tensor of matching shape. This gradient tensor acts as the vector in the vector-Jacobian product.

import torch

x = torch.randn(3, requires_grad=True)
print(f"x: {x}")

y = x * 2
while y.norm() < 1000: # Example of a dynamic graph
    y = y * 2
print(f"y: {y}")

# y is now a vector, not a scalar. We need to pass a vector to backward().
# If y were a scalar, like y.mean(), we wouldn't need to pass an argument.
# Let's consider the gradient of (y * v) where v is a vector of same size as y.
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float32)
y.backward(v)

print(f"Gradient of (y * v) with respect to x: {x.grad}")

Disabling Gradient Tracking

Sometimes you might need to perform an operation without tracking gradients. This is useful for: 1. Freezing parts of your model: When training a pre-trained network, you might want to freeze some layers. 2. Performing inference: When you're just making predictions, you don't need to compute gradients, saving memory and computation.

You can do this using torch.no_grad() or .detach().

import torch

x = torch.tensor(5.0, requires_grad=True)
y = x * 2

print(f"y.requires_grad: {y.requires_grad}")

# Option 1: Using torch.no_grad() context manager
with torch.no_grad():
    z = x * 3
    print(f"z.requires_grad inside no_grad: {z.requires_grad}")

# Option 2: Using .detach() method
w = x.detach()
print(f"w.requires_grad (detached): {w.requires_grad}")

# Trying to backpropagate through 'z' or 'w' will not affect x's gradient
try:
    z.backward()
except RuntimeError as e:
    print(f"\nRuntimeError when calling z.backward(): {e}") # This will not happen, but z would have no grad_fn

# A more direct example for inference
model = torch.nn.Linear(1, 1)
input_data = torch.randn(1, 1)

with torch.no_grad():
    output = model(input_data)
print(f"\nOutput in inference mode (no_grad): {output.requires_grad}")

output_with_grad = model(input_data)
print(f"Output in training mode (with grad): {output_with_grad.requires_grad}")

Further Topics:

autograd in nn.Module
Custom autograd Functions
Profiling with torch.autograd.profiler
Gradient accumulation and clipping

This document provides an introduction to PyTorch's autograd system. Understanding this mechanism is crucial for building and training neural networks efficiently. Subsequent files will delve into building neural networks, optimization, and more.