Deep Learning: Advanced Architectures and Training
Deep Learning (DL) is a subset of machine learning based on artificial neural networks with multiple layers (hence "deep"). This guide covers advanced architectures and the mechanics of training robust models.
1. Key Architectures
๐ผ๏ธ Convolutional Neural Networks (CNN)
- Primary Use: Computer Vision (Object detection, Image classification).
- Core Layers:
- Convolutional: Extracts features (edges, shapes) using filters (kernels).
- Pooling (Max/Average): Reduces dimensionality (downsampling) to make features invariant to position.
- Fully Connected: Maps high-level features to output classes.
๐ Transformers
- Primary Use: Natural Language Processing (LLMs), Computer Vision (Vision Transformers).
- Core Concept: Attention Mechanism.
- Allows the model to weigh the importance of different parts of the input data regardless of distance.
- Self-Attention: "The cat sat on the mat." (The model learns that "sat" is related to "cat").
๐ Recurrent Neural Networks (RNN & LSTM)
- Primary Use: Time-series (Stock prediction), Sequential data.
- LSTM (Long Short-Term Memory): Solves the Vanishing Gradient problem by using "gates" to decide what information to keep or forget over long sequences.
2. Training Mechanics: The "Secret Sauce"
๐งช Overfitting vs. Underfitting
- Overfitting: Model performs great on training data but poorly on test data (it "memorized" the noise).
- Fixes: Dropout (randomly disabling neurons), Data Augmentation, Early Stopping.
- Underfitting: Model is too simple to learn the underlying patterns.
- Fixes: Increase model complexity, train for more epochs.
๐ Optimizers
- SGD (Stochastic Gradient Descent): Basic but reliable.
- Adam (Adaptive Moment Estimation): The industry default. Combines the benefits of AdaGrad and RMSProp. It adjusts the learning rate for each parameter.
๐๏ธ Transfer Learning
Instead of training from scratch, you take a model pre-trained on a massive dataset (like ImageNet) and "fine-tune" the last few layers for your specific task. - Why?: Requires significantly less data and compute power.
3. Interview Deep-Dive (FAQs)
- What is the Vanishing Gradient problem?
- During backpropagation, gradients are multiplied layer by layer. In very deep networks, if gradients are small (<1), they shrink exponentially as they go back, eventually becoming zero. This means early layers never "learn".
- Explain the purpose of Batch Normalization.
- It normalizes the inputs of each layer to have a mean of zero and a variance of one. This stabilizes the training process, allows for higher learning rates, and acts as a mild form of regularization.
- What is the difference between L1 and L2 Regularization?
- L1 (Lasso): Adds the absolute value of weights to the loss function. It can drive weights to exactly zero, performing feature selection.
- L2 (Ridge): Adds the square of weights. It penalizes large weights but doesn't drive them to zero. This is more common in DL.
๐งช Implementation: Simple Neural Network (PyTorch)
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self):
super(SimpleNet, self).__init__()
self.fc1 = nn.Linear(784, 128) # Input to hidden
self.relu = nn.ReLU()
self.fc2 = nn.Linear(128, 10) # Hidden to output
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
model = SimpleNet()
print(model)