Master convolutional neural networks and modern deep learning architectures
Applies learnable filters (kernels) to extract local features:
• Slide kernel over input
• Element-wise multiply and sum
• Shared weights across spatial locations
• = input width/height
• = kernel size
• = padding
• = stride
• Much fewer than fully connected
• Example: 3×3×64 filters, 128 output channels = 73,856 params
Max Pooling
Take maximum value in each window. Captures strongest feature activation.
Average Pooling
Take average value in each window. Smoother down-sampling.
Final layers flatten feature maps and perform classification. Typically 1-2 FC layers before output.
Simple architecture: stacked 3×3 convolutions
Key Insights:
✓ Simple, uniform architecture
✗ Many parameters (138M for VGG-16)
Residual connections enable very deep networks (152+ layers)
Residual Block:
Network learns residual instead of full mapping. Identity shortcuts allow gradient flow.
✓ Solves degradation problem
✓ Easier to optimize
✓ State-of-the-art accuracy
Parallel paths with different kernel sizes
Inception Module:
✓ Multi-scale feature extraction
Efficient architecture for mobile devices
Depthwise Separable Convolution:
✓ Fast inference on mobile
Transfer learning uses models pre-trained on large datasets as initialization for new tasks:
Step 1
Pre-train on large dataset (e.g., ImageNet 1.4M images)
Step 2
Load pre-trained weights
Step 3
Fine-tune on target task (smaller dataset)
Feature Extraction (Freeze Early Layers)
Freeze early/middle layers, only train final layers. Use when target dataset is small and similar to source.
Fine-Tune All Layers
Train all layers with small learning rate. Use when target dataset is large or different from source.
Gradual Unfreezing
Start with frozen layers, gradually unfreeze from top to bottom. Balances stability and adaptation.
Normalizes layer inputs to stabilize training:
• Computed per mini-batch during training
• Learnable parameters:
• Use running statistics at test time
Randomly drop neurons during training to prevent overfitting:
• Typical: (50% dropout)
• Training: randomly set activations to 0
• Testing: use all neurons, scale by
• Forces network to learn redundant representations
Artificially expand training data:
Adjust learning rate during training:
| Layer | Type | Output Shape | Parameters |
|---|---|---|---|
| Input | - | 32×32×3 | 0 |
| Conv1 | 3×3×32, ReLU | 32×32×32 | 896 |
| MaxPool1 | 2×2, stride 2 | 16×16×32 | 0 |
| Conv2 | 3×3×64, ReLU | 16×16×64 | 18,496 |
| MaxPool2 | 2×2, stride 2 | 8×8×64 | 0 |
| Flatten | - | 4096 | 0 |
| FC1 | 128, ReLU | 128 | 524,416 |
| Dropout | p=0.5 | 128 | 0 |
| Output | 10, Softmax | 10 | 1,290 |
| Total Parameters | 545,098 | ||
For input X and kernel/filter K:
For multi-channel input (e.g., RGB image with c channels):
where d indexes output channels (filters), C is input channels
Parameters:
Step-by-Step Derivation:
General Output Size Formula:
Common Cases:
Receptive field: Region in input that affects a single output neuron
For L stacked convolutional layers:
where K_i is kernel size and S_j is stride of layer j.
Example: 3 layers, K=3, S=1:
RF = 3 + (3-1)×1 + (3-1)×1 = 3 + 2 + 2 = 7 pixels
Given gradient from next layer: ∂L/∂Y, need to compute:
Each kernel weight K_mn affects multiple output positions. Using chain rule:
Since Y_ij = Σ X_(i+m)(j+n) K_mn, we have ∂Y_ij/∂K_mn = X_(i+m)(j+n)
Result:
Gradient w.r.t. kernel is the convolution of input with gradient!
Each input pixel X_pq affects multiple outputs. Using chain rule:
Y_ij involves X_pq when (i+m, j+n) = (p, q), i.e., when i = p-m, j = q-n
Result:
Gradient w.r.t. input is "full" convolution with 180° rotated kernel!
Bias is added to all output positions, so:
Simply sum all gradients in the output feature map!
Key observation: All three gradients can be computed using convolution operations!
For mini-batch B of activations:
Step 1: Compute batch mean
Step 2: Compute batch variance
Step 3: Normalize
ε (e.g., 10^-5) prevents division by zero
Step 4: Scale and shift (learnable parameters)
γ and β are learned to restore representational power
Given gradient ∂L/∂y, compute ∂L/∂x, ∂L/∂γ, ∂L/∂β
Gradients for scale/shift parameters:
Gradient w.r.t. normalized input:
Gradient w.r.t. variance:
Gradient w.r.t. mean:
Final gradient w.r.t. input:
Reduces Internal Covariate Shift
Normalizes activations, keeping their distribution stable during training. Each layer doesn't have to constantly adapt to changing input distributions.
Smooths Loss Landscape
Makes optimization landscape more Lipschitz smooth, allowing larger learning rates and faster convergence.
where F is the residual function (e.g., conv-BN-ReLU-conv-BN), and x_l is the input/output of layer l.
Key insight: Identity mapping (x_l) is added directly to learned residual F(x_l).
Consider gradient flowing back from layer L to layer l (l < L):
Applying chain rule:
Note: Identity (I) appears! This is crucial.
Full gradient from layer L to layer l:
Key Observation:
Comparison:
Highway for gradients: Skip connections create "information superhighways"
This enables training networks with 100+ layers (ResNet-152, ResNet-1000)!
Test your understanding with 10 multiple-choice questions