MathIsimple
Back to Machine Learning
Deep Learning
Advanced Level

Deep Learning & CNNs

Master convolutional neural networks and modern deep learning architectures

Convolutional Neural Networks

Core Building Blocks

1. Convolutional Layer

Applies learnable filters (kernels) to extract local features:

yi,j,k=cmnwm,n,c,kxi+m,j+n,c+bky_{i,j,k} = \sum_{c}\sum_{m}\sum_{n} w_{m,n,c,k} \cdot x_{i+m,j+n,c} + b_k

• Slide kernel over input

• Element-wise multiply and sum

• Shared weights across spatial locations

Output Size Formula

O=WK+2PS+1O = \frac{W - K + 2P}{S} + 1

• = input width/height

• = kernel size

• = padding

• = stride

Parameter Count

params=(K×K×Cin+1)×Cout\text{params} = (K \times K \times C_{\text{in}} + 1) \times C_{\text{out}}

• Much fewer than fully connected

• Example: 3×3×64 filters, 128 output channels = 73,856 params

2. Pooling Layer

Max Pooling

Take maximum value in each window. Captures strongest feature activation.

Most common choice

Average Pooling

Take average value in each window. Smoother down-sampling.

Used in some architectures

3. Fully Connected Layer

Final layers flatten feature maps and perform classification. Typically 1-2 FC layers before output.

Modern Architectures

VGGNet (2014)

Simple architecture: stacked 3×3 convolutions

Key Insights:

  • • Use small (3×3) filters consistently
  • • Stack multiple conv layers before pooling
  • • Two 3×3 convs = same receptive field as 5×5 but fewer params
  • • Depth matters: VGG-16, VGG-19

✓ Simple, uniform architecture

✗ Many parameters (138M for VGG-16)

ResNet (2015)

Residual connections enable very deep networks (152+ layers)

Residual Block:

y=F(x,{Wi})+x\mathbf{y} = F(\mathbf{x}, \{W_i\}) + \mathbf{x}

Network learns residual instead of full mapping. Identity shortcuts allow gradient flow.

✓ Solves degradation problem

✓ Easier to optimize

✓ State-of-the-art accuracy

Inception (GoogLeNet)

Parallel paths with different kernel sizes

Inception Module:

  • • 1×1 conv (channel reduction)
  • • 3×3 conv
  • • 5×5 conv
  • • 3×3 max pool
  • → Concatenate all outputs

✓ Multi-scale feature extraction

MobileNet

Efficient architecture for mobile devices

Depthwise Separable Convolution:

  • 1. Depthwise: 3×3 per channel
  • 2. Pointwise: 1×1 to combine
  • → 8-9× fewer parameters

✓ Fast inference on mobile

Transfer Learning

Leveraging Pre-trained Models

Transfer learning uses models pre-trained on large datasets as initialization for new tasks:

Step 1

Pre-train on large dataset (e.g., ImageNet 1.4M images)

Step 2

Load pre-trained weights

Step 3

Fine-tune on target task (smaller dataset)

Fine-Tuning Strategies

Feature Extraction (Freeze Early Layers)

Freeze early/middle layers, only train final layers. Use when target dataset is small and similar to source.

Fine-Tune All Layers

Train all layers with small learning rate. Use when target dataset is large or different from source.

Gradual Unfreezing

Start with frozen layers, gradually unfreeze from top to bottom. Balances stability and adaptation.

Deep Learning Techniques

Batch Normalization

Normalizes layer inputs to stabilize training:

x^=xμBσB2+ϵ\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}y=γx^+βy = \gamma\hat{x} + \beta

• Computed per mini-batch during training
• Learnable parameters:
• Use running statistics at test time

Dropout

Randomly drop neurons during training to prevent overfitting:

adropout=am,miBernoulli(p)\mathbf{a}_{\text{dropout}} = \mathbf{a} \odot \mathbf{m}, \quad m_i \sim \text{Bernoulli}(p)

• Typical: (50% dropout)
• Training: randomly set activations to 0
• Testing: use all neurons, scale by
• Forces network to learn redundant representations

Data Augmentation

Artificially expand training data:

  • • Rotation, flipping, scaling
  • • Random crops and color jittering
  • • Cutout, mixup, cutmix
  • • Increases model robustness
Learning Rate Scheduling

Adjust learning rate during training:

  • • Step decay: reduce by factor every N epochs
  • • Exponential decay:
  • • Cosine annealing: smooth curve
  • • Improves convergence and final accuracy

Example: Image Classification Architecture

Simple CNN for 32×32 RGB Images, 10 Classes
LayerTypeOutput ShapeParameters
Input-32×32×30
Conv13×3×32, ReLU32×32×32896
MaxPool12×2, stride 216×16×320
Conv23×3×64, ReLU16×16×6418,496
MaxPool22×2, stride 28×8×640
Flatten-40960
FC1128, ReLU128524,416
Dropoutp=0.51280
Output10, Softmax101,290
Total Parameters545,098

Key Design Choices

  • • Small 3×3 filters (efficient)
  • • Increasing channels: 3→32→64 (capture more complex features)
  • • Pooling after conv (reduce dimensions)
  • • Dropout before output (regularization)

Training Setup

  • • Optimizer: Adam (lr=0.001)
  • • Loss: Categorical cross-entropy
  • • Batch size: 64
  • • Data augmentation: horizontal flip, random crop

Convolution Operation: Mathematical Analysis

2D Convolution and Output Size Calculation

2D Discrete Convolution

For input X and kernel/filter K:

(XK)ij=m=0k1n=0k1Xi+m,j+nKm,n(X * K)_{ij} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} X_{i+m, j+n} \cdot K_{m,n}

For multi-channel input (e.g., RGB image with c channels):

Yij(d)=c=1Cm=0k1n=0k1Xi+m,j+n(c)Km,n(c,d)+b(d)Y_{ij}^{(d)} = \sum_{c=1}^{C} \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} X_{i+m, j+n}^{(c)} \cdot K_{m,n}^{(c,d)} + b^{(d)}

where d indexes output channels (filters), C is input channels

Output Size Formula Derivation

Parameters:

  • • Input size: W × H
  • • Kernel size: K × K
  • • Stride: S (step size)
  • • Padding: P (pixels added to borders)

Step-by-Step Derivation:

  1. With padding P, effective input size: (W+2P) × (H+2P)
  2. Kernel covers K pixels, so valid range: (W+2P-K+1)
  3. With stride S, number of positions: floor((W+2P-K)/S) + 1

General Output Size Formula:

Wout=W+2PKS+1W_{\text{out}} = \left\lfloor \frac{W + 2P - K}{S} \right\rfloor + 1Hout=H+2PKS+1H_{\text{out}} = \left\lfloor \frac{H + 2P - K}{S} \right\rfloor + 1

Common Cases:

  • Same padding: P = (K-1)/2, S=1 → W_out = W (preserves size)
  • Valid padding: P = 0 → shrinks by (K-1) pixels
  • Stride 2: Roughly halves dimensions

Receptive Field Calculation

Receptive field: Region in input that affects a single output neuron

For L stacked convolutional layers:

RF=K1+i=2L(Ki1)j=1i1SjRF = K_1 + \sum_{i=2}^{L} (K_i - 1) \prod_{j=1}^{i-1} S_j

where K_i is kernel size and S_j is stride of layer j.

Example: 3 layers, K=3, S=1:

RF = 3 + (3-1)×1 + (3-1)×1 = 3 + 2 + 2 = 7 pixels

Backpropagation Through Convolution Layers

Gradient Computation for Convolutional Layers

Forward Pass Notation

Y=XK+bY = X * K + b

Given gradient from next layer: ∂L/∂Y, need to compute:

  • • ∂L/∂K (gradient w.r.t. kernel weights)
  • • ∂L/∂b (gradient w.r.t. bias)
  • • ∂L/∂X (gradient to propagate backward)

Gradient w.r.t. Kernel: ∂L/∂K

Each kernel weight K_mn affects multiple output positions. Using chain rule:

LKmn=i,jLYijYijKmn\frac{\partial L}{\partial K_{mn}} = \sum_{i,j} \frac{\partial L}{\partial Y_{ij}} \cdot \frac{\partial Y_{ij}}{\partial K_{mn}}

Since Y_ij = Σ X_(i+m)(j+n) K_mn, we have ∂Y_ij/∂K_mn = X_(i+m)(j+n)

Result:

LK=LYX\frac{\partial L}{\partial K} = \frac{\partial L}{\partial Y} * X

Gradient w.r.t. kernel is the convolution of input with gradient!

Gradient w.r.t. Input: ∂L/∂X

Each input pixel X_pq affects multiple outputs. Using chain rule:

LXpq=i,jLYijYijXpq\frac{\partial L}{\partial X_{pq}} = \sum_{i,j} \frac{\partial L}{\partial Y_{ij}} \cdot \frac{\partial Y_{ij}}{\partial X_{pq}}

Y_ij involves X_pq when (i+m, j+n) = (p, q), i.e., when i = p-m, j = q-n

YijXpq={Kpi,qjif 0pi<k,0qj<k0otherwise\frac{\partial Y_{ij}}{\partial X_{pq}} = \begin{cases} K_{p-i, q-j} & \text{if } 0 \leq p-i < k, 0 \leq q-j < k \\ 0 & \text{otherwise} \end{cases}

Result:

LX=LYKrot180\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} * K_{\text{rot180}}

Gradient w.r.t. input is "full" convolution with 180° rotated kernel!

Gradient w.r.t. Bias

Bias is added to all output positions, so:

Lb=i,jLYij\frac{\partial L}{\partial b} = \sum_{i,j} \frac{\partial L}{\partial Y_{ij}}

Simply sum all gradients in the output feature map!

Implementation Insight

Key observation: All three gradients can be computed using convolution operations!

  • • Modern frameworks (PyTorch, TensorFlow) use highly optimized convolution kernels
  • • Backprop through conv layer has similar complexity to forward pass
  • • Can leverage GPU parallelization for both forward and backward

Batch Normalization Mathematical Derivation

Normalizing Activations for Stable Training

Forward Pass Algorithm

For mini-batch B of activations:

Step 1: Compute batch mean

μB=1mi=1mxi\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i

Step 2: Compute batch variance

σB2=1mi=1m(xiμB)2\sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m} (x_i - \mu_B)^2

Step 3: Normalize

x^i=xiμBσB2+ϵ\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

ε (e.g., 10^-5) prevents division by zero

Step 4: Scale and shift (learnable parameters)

yi=γx^i+βy_i = \gamma \hat{x}_i + \beta

γ and β are learned to restore representational power

Backward Pass: Gradient Derivation

Given gradient ∂L/∂y, compute ∂L/∂x, ∂L/∂γ, ∂L/∂β

Gradients for scale/shift parameters:

Lγ=i=1mLyix^i\frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \hat{x}_iLβ=i=1mLyi\frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i}

Gradient w.r.t. normalized input:

Lx^i=Lyiγ\frac{\partial L}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i} \cdot \gamma

Gradient w.r.t. variance:

LσB2=i=1mLx^i(xiμB)12(σB2+ϵ)3/2\frac{\partial L}{\partial \sigma_B^2} = \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{x}_i} \cdot (x_i - \mu_B) \cdot \frac{-1}{2}(\sigma_B^2 + \epsilon)^{-3/2}

Gradient w.r.t. mean:

LμB=i=1mLx^i1σB2+ϵ+LσB22i(xiμB)m\frac{\partial L}{\partial \mu_B} = \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma_B^2 + \epsilon}} + \frac{\partial L}{\partial \sigma_B^2} \cdot \frac{-2\sum_i(x_i-\mu_B)}{m}

Final gradient w.r.t. input:

Lxi=Lx^i1σB2+ϵ+LσB22(xiμB)m+LμB1m\frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_B^2+\epsilon}} + \frac{\partial L}{\partial \sigma_B^2} \cdot \frac{2(x_i-\mu_B)}{m} + \frac{\partial L}{\partial \mu_B} \cdot \frac{1}{m}

Why Batch Norm Works

Reduces Internal Covariate Shift

Normalizes activations, keeping their distribution stable during training. Each layer doesn't have to constantly adapt to changing input distributions.

Smooths Loss Landscape

Makes optimization landscape more Lipschitz smooth, allowing larger learning rates and faster convergence.

Residual Connections: Gradient Flow Analysis

Why Skip Connections Enable Very Deep Networks

Residual Block Formulation

xl+1=xl+F(xl,Wl)\mathbf{x}_{l+1} = \mathbf{x}_l + F(\mathbf{x}_l, W_l)

where F is the residual function (e.g., conv-BN-ReLU-conv-BN), and x_l is the input/output of layer l.

Key insight: Identity mapping (x_l) is added directly to learned residual F(x_l).

Gradient Flow Through Skip Connections

Consider gradient flowing back from layer L to layer l (l < L):

xLxl=xl(xl+i=lL1F(xi,Wi))\frac{\partial \mathbf{x}_L}{\partial \mathbf{x}_l} = \frac{\partial}{\partial \mathbf{x}_l}\left(\mathbf{x}_l + \sum_{i=l}^{L-1} F(\mathbf{x}_i, W_i)\right)

Applying chain rule:

xLxl=I+xli=lL1F(xi,Wi)\frac{\partial \mathbf{x}_L}{\partial \mathbf{x}_l} = \mathbf{I} + \frac{\partial}{\partial \mathbf{x}_l}\sum_{i=l}^{L-1} F(\mathbf{x}_i, W_i)

Note: Identity (I) appears! This is crucial.

Why This Solves Vanishing Gradients

Full gradient from layer L to layer l:

Lxl=LxLxLxl=LxL(I+xli=lL1F(xi,Wi))\frac{\partial L}{\partial \mathbf{x}_l} = \frac{\partial L}{\partial \mathbf{x}_L} \cdot \frac{\partial \mathbf{x}_L}{\partial \mathbf{x}_l} = \frac{\partial L}{\partial \mathbf{x}_L} \left(\mathbf{I} + \frac{\partial}{\partial \mathbf{x}_l}\sum_{i=l}^{L-1} F(\mathbf{x}_i, W_i)\right)

Key Observation:

  • • The identity term (I) ensures gradient has a direct path backward
  • • Even if residual gradient (∂F/∂x) vanishes, identity gradient remains
  • • Gradient cannot vanish completely due to identity shortcut

Comparison:

  • Without skip: ∂x_L/∂x_l = Π (many small terms) → vanishes
  • With skip: ∂x_L/∂x_l = I + ... → always at least I

Mathematical Intuition

Highway for gradients: Skip connections create "information superhighways"

  • • Gradients can flow directly through identity shortcuts
  • • Residual blocks learn refinements (F) on top of identity
  • • If F is detrimental, network can learn to make it close to zero
  • • Easier to learn identity + small adjustments than full transformation

This enables training networks with 100+ layers (ResNet-152, ResNet-1000)!

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
In a convolutional layer with input 32×32×3, kernel 5×5, stride 1, padding 0, how many pixels does the output have per channel?
Not attempted
2
What is the primary purpose of pooling layers in CNNs?
Not attempted
3
In max pooling with 2×2 window and stride 2, what happens to a 28×28 feature map?
Not attempted
4
How many parameters does a convolutional layer have with 32 filters, 3×3 kernel, and input channels 64?
Not attempted
5
What innovation did ResNet introduce to enable very deep networks?
Not attempted
6
What is transfer learning?
Not attempted
7
In a CNN for image classification, which layers typically learn the most general features?
Not attempted
8
What is the main advantage of 1×1 convolutions?
Not attempted
9
What problem does batch normalization primarily address?
Not attempted
10
Which architecture introduced the idea of 'Inception modules' with multiple parallel conv paths?
Not attempted