MathIsimple – Simple, Friendly Math Tools & Learning

Convolutional Neural Networks

Core Building Blocks

1. Convolutional Layer

Applies learnable filters (kernels) to extract local features:

y_{i,j,k} = \sum_{c}\sum_{m}\sum_{n} w_{m,n,c,k} \cdot x_{i+m,j+n,c} + b_k

• Slide kernel over input

• Element-wise multiply and sum

• Shared weights across spatial locations

Output Size Formula

O = \frac{W - K + 2P}{S} + 1

• = input width/height

• = kernel size

• = padding

• = stride

Parameter Count

\text{params} = (K \times K \times C_{\text{in}} + 1) \times C_{\text{out}}

• Much fewer than fully connected

• Example: 3×3×64 filters, 128 output channels = 73,856 params

2. Pooling Layer

Max Pooling

Take maximum value in each window. Captures strongest feature activation.

Most common choice

Average Pooling

Take average value in each window. Smoother down-sampling.

Used in some architectures

3. Fully Connected Layer

Final layers flatten feature maps and perform classification. Typically 1-2 FC layers before output.

Modern Architectures

VGGNet (2014)

Simple architecture: stacked 3×3 convolutions

Key Insights:

• Use small (3×3) filters consistently
• Stack multiple conv layers before pooling
• Two 3×3 convs = same receptive field as 5×5 but fewer params
• Depth matters: VGG-16, VGG-19

✓ Simple, uniform architecture

✗ Many parameters (138M for VGG-16)

ResNet (2015)

Residual connections enable very deep networks (152+ layers)

Residual Block:

\mathbf{y} = F(\mathbf{x}, \{W_i\}) + \mathbf{x}

Network learns residual instead of full mapping. Identity shortcuts allow gradient flow.

✓ Solves degradation problem

✓ Easier to optimize

✓ State-of-the-art accuracy

Inception (GoogLeNet)

Parallel paths with different kernel sizes

Inception Module:

• 1×1 conv (channel reduction)
• 3×3 conv
• 5×5 conv
• 3×3 max pool
→ Concatenate all outputs

✓ Multi-scale feature extraction

MobileNet

Efficient architecture for mobile devices

Depthwise Separable Convolution:

1. Depthwise: 3×3 per channel
2. Pointwise: 1×1 to combine
→ 8-9× fewer parameters

✓ Fast inference on mobile

Transfer Learning

Leveraging Pre-trained Models

Transfer learning uses models pre-trained on large datasets as initialization for new tasks:

Step 1

Pre-train on large dataset (e.g., ImageNet 1.4M images)

Step 2

Load pre-trained weights

Step 3

Fine-tune on target task (smaller dataset)

Fine-Tuning Strategies

Feature Extraction (Freeze Early Layers)

Freeze early/middle layers, only train final layers. Use when target dataset is small and similar to source.

Fine-Tune All Layers

Train all layers with small learning rate. Use when target dataset is large or different from source.

Gradual Unfreezing

Start with frozen layers, gradually unfreeze from top to bottom. Balances stability and adaptation.

Deep Learning Techniques

Batch Normalization

Normalizes layer inputs to stabilize training:

\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y = \gamma\hat{x} + \beta

• Computed per mini-batch during training
• Learnable parameters:
• Use running statistics at test time

Dropout

Randomly drop neurons during training to prevent overfitting:

\mathbf{a}_{\text{dropout}} = \mathbf{a} \odot \mathbf{m}, \quad m_i \sim \text{Bernoulli}(p)

• Typical: (50% dropout)
• Training: randomly set activations to 0
• Testing: use all neurons, scale by
• Forces network to learn redundant representations

Data Augmentation

Artificially expand training data:

• Rotation, flipping, scaling
• Random crops and color jittering
• Cutout, mixup, cutmix
• Increases model robustness

Learning Rate Scheduling

Adjust learning rate during training:

• Step decay: reduce by factor every N epochs
• Exponential decay:
• Cosine annealing: smooth curve
• Improves convergence and final accuracy

Example: Image Classification Architecture

Simple CNN for 32×32 RGB Images, 10 Classes

Layer	Type	Output Shape	Parameters
Input	-	32×32×3	0
Conv1	3×3×32, ReLU	32×32×32	896
MaxPool1	2×2, stride 2	16×16×32	0
Conv2	3×3×64, ReLU	16×16×64	18,496
MaxPool2	2×2, stride 2	8×8×64	0
Flatten	-	4096	0
FC1	128, ReLU	128	524,416
Dropout	p=0.5	128	0
Output	10, Softmax	10	1,290
Total Parameters			545,098

Key Design Choices

• Small 3×3 filters (efficient)
• Increasing channels: 3→32→64 (capture more complex features)
• Pooling after conv (reduce dimensions)
• Dropout before output (regularization)

Training Setup

• Optimizer: Adam (lr=0.001)
• Loss: Categorical cross-entropy
• Batch size: 64
• Data augmentation: horizontal flip, random crop

View More Examples & Exercises

Convolution Operation: Mathematical Analysis

2D Convolution and Output Size Calculation

2D Discrete Convolution

For input X and kernel/filter K:

(X * K)_{ij} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} X_{i+m, j+n} \cdot K_{m,n}

For multi-channel input (e.g., RGB image with c channels):

Y_{ij}^{(d)} = \sum_{c=1}^{C} \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} X_{i+m, j+n}^{(c)} \cdot K_{m,n}^{(c,d)} + b^{(d)}

where d indexes output channels (filters), C is input channels

Output Size Formula Derivation

Parameters:

• Input size: W × H
• Kernel size: K × K
• Stride: S (step size)
• Padding: P (pixels added to borders)

Step-by-Step Derivation:

With padding P, effective input size: (W+2P) × (H+2P)
Kernel covers K pixels, so valid range: (W+2P-K+1)
With stride S, number of positions: floor((W+2P-K)/S) + 1

General Output Size Formula:

W_{\text{out}} = \left\lfloor \frac{W + 2P - K}{S} \right\rfloor + 1

H_{\text{out}} = \left\lfloor \frac{H + 2P - K}{S} \right\rfloor + 1

Common Cases:

• Same padding: P = (K-1)/2, S=1 → W_out = W (preserves size)
• Valid padding: P = 0 → shrinks by (K-1) pixels
• Stride 2: Roughly halves dimensions

Receptive Field Calculation

Receptive field: Region in input that affects a single output neuron

For L stacked convolutional layers:

RF = K_1 + \sum_{i=2}^{L} (K_i - 1) \prod_{j=1}^{i-1} S_j

where K_i is kernel size and S_j is stride of layer j.

Example: 3 layers, K=3, S=1:

RF = 3 + (3-1)×1 + (3-1)×1 = 3 + 2 + 2 = 7 pixels

Backpropagation Through Convolution Layers

Gradient Computation for Convolutional Layers

Forward Pass Notation

Y = X * K + b

Given gradient from next layer: ∂L/∂Y, need to compute:

• ∂L/∂K (gradient w.r.t. kernel weights)
• ∂L/∂b (gradient w.r.t. bias)
• ∂L/∂X (gradient to propagate backward)

Gradient w.r.t. Kernel: ∂L/∂K

Each kernel weight K_mn affects multiple output positions. Using chain rule:

\frac{\partial L}{\partial K_{mn}} = \sum_{i,j} \frac{\partial L}{\partial Y_{ij}} \cdot \frac{\partial Y_{ij}}{\partial K_{mn}}

Since Y_ij = Σ X_(i+m)(j+n) K_mn, we have ∂Y_ij/∂K_mn = X_(i+m)(j+n)

Result:

\frac{\partial L}{\partial K} = \frac{\partial L}{\partial Y} * X

Gradient w.r.t. kernel is the convolution of input with gradient!

Gradient w.r.t. Input: ∂L/∂X

Each input pixel X_pq affects multiple outputs. Using chain rule:

\frac{\partial L}{\partial X_{pq}} = \sum_{i,j} \frac{\partial L}{\partial Y_{ij}} \cdot \frac{\partial Y_{ij}}{\partial X_{pq}}

Y_ij involves X_pq when (i+m, j+n) = (p, q), i.e., when i = p-m, j = q-n

\frac{\partial Y_{ij}}{\partial X_{pq}} = \begin{cases} K_{p-i, q-j} & \text{if } 0 \leq p-i < k, 0 \leq q-j < k \\ 0 & \text{otherwise} \end{cases}

Result:

\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} * K_{\text{rot180}}

Gradient w.r.t. input is "full" convolution with 180° rotated kernel!

Gradient w.r.t. Bias

Bias is added to all output positions, so:

\frac{\partial L}{\partial b} = \sum_{i,j} \frac{\partial L}{\partial Y_{ij}}

Simply sum all gradients in the output feature map!

Implementation Insight

Key observation: All three gradients can be computed using convolution operations!

• Modern frameworks (PyTorch, TensorFlow) use highly optimized convolution kernels
• Backprop through conv layer has similar complexity to forward pass
• Can leverage GPU parallelization for both forward and backward

Batch Normalization Mathematical Derivation

Normalizing Activations for Stable Training

Forward Pass Algorithm

For mini-batch B of activations:

Step 1: Compute batch mean

\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i

Step 2: Compute batch variance

\sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m} (x_i - \mu_B)^2

Step 3: Normalize

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

ε (e.g., 10^-5) prevents division by zero

Step 4: Scale and shift (learnable parameters)

y_i = \gamma \hat{x}_i + \beta

γ and β are learned to restore representational power

Backward Pass: Gradient Derivation

Given gradient ∂L/∂y, compute ∂L/∂x, ∂L/∂γ, ∂L/∂β

Gradients for scale/shift parameters:

\frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \hat{x}_i

\frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i}

Gradient w.r.t. normalized input:

\frac{\partial L}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i} \cdot \gamma

Gradient w.r.t. variance:

\frac{\partial L}{\partial \sigma_B^2} = \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{x}_i} \cdot (x_i - \mu_B) \cdot \frac{-1}{2}(\sigma_B^2 + \epsilon)^{-3/2}

Gradient w.r.t. mean:

\frac{\partial L}{\partial \mu_B} = \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma_B^2 + \epsilon}} + \frac{\partial L}{\partial \sigma_B^2} \cdot \frac{-2\sum_i(x_i-\mu_B)}{m}

Final gradient w.r.t. input:

\frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_B^2+\epsilon}} + \frac{\partial L}{\partial \sigma_B^2} \cdot \frac{2(x_i-\mu_B)}{m} + \frac{\partial L}{\partial \mu_B} \cdot \frac{1}{m}

Why Batch Norm Works

Reduces Internal Covariate Shift

Normalizes activations, keeping their distribution stable during training. Each layer doesn't have to constantly adapt to changing input distributions.

Smooths Loss Landscape

Makes optimization landscape more Lipschitz smooth, allowing larger learning rates and faster convergence.

Residual Connections: Gradient Flow Analysis

Why Skip Connections Enable Very Deep Networks

Residual Block Formulation

\mathbf{x}_{l+1} = \mathbf{x}_l + F(\mathbf{x}_l, W_l)

where F is the residual function (e.g., conv-BN-ReLU-conv-BN), and x_l is the input/output of layer l.

Key insight: Identity mapping (x_l) is added directly to learned residual F(x_l).

Gradient Flow Through Skip Connections

Consider gradient flowing back from layer L to layer l (l < L):

\frac{\partial \mathbf{x}_L}{\partial \mathbf{x}_l} = \frac{\partial}{\partial \mathbf{x}_l}\left(\mathbf{x}_l + \sum_{i=l}^{L-1} F(\mathbf{x}_i, W_i)\right)

Applying chain rule:

\frac{\partial \mathbf{x}_L}{\partial \mathbf{x}_l} = \mathbf{I} + \frac{\partial}{\partial \mathbf{x}_l}\sum_{i=l}^{L-1} F(\mathbf{x}_i, W_i)

Note: Identity (I) appears! This is crucial.

Why This Solves Vanishing Gradients

Full gradient from layer L to layer l:

\frac{\partial L}{\partial \mathbf{x}_l} = \frac{\partial L}{\partial \mathbf{x}_L} \cdot \frac{\partial \mathbf{x}_L}{\partial \mathbf{x}_l} = \frac{\partial L}{\partial \mathbf{x}_L} \left(\mathbf{I} + \frac{\partial}{\partial \mathbf{x}_l}\sum_{i=l}^{L-1} F(\mathbf{x}_i, W_i)\right)

Key Observation:

• The identity term (I) ensures gradient has a direct path backward
• Even if residual gradient (∂F/∂x) vanishes, identity gradient remains
• Gradient cannot vanish completely due to identity shortcut

Comparison:

• Without skip: ∂x_L/∂x_l = Π (many small terms) → vanishes
• With skip: ∂x_L/∂x_l = I + ... → always at least I

Mathematical Intuition

Highway for gradients: Skip connections create "information superhighways"

• Gradients can flow directly through identity shortcuts
• Residual blocks learn refinements (F) on top of identity
• If F is detrimental, network can learn to make it close to zero
• Easier to learn identity + small adjustments than full transformation

This enables training networks with 100+ layers (ResNet-152, ResNet-1000)!

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

In a convolutional layer with input 32×32×3, kernel 5×5, stride 1, padding 0, how many pixels does the output have per channel?

Not attempted

2

What is the primary purpose of pooling layers in CNNs?

Not attempted

3

In max pooling with 2×2 window and stride 2, what happens to a 28×28 feature map?

Not attempted

4

How many parameters does a convolutional layer have with 32 filters, 3×3 kernel, and input channels 64?

Not attempted

5

What innovation did ResNet introduce to enable very deep networks?

Not attempted

6

What is transfer learning?

Not attempted

7

In a CNN for image classification, which layers typically learn the most general features?

Not attempted

8

What is the main advantage of 1×1 convolutions?

Not attempted

9

What problem does batch normalization primarily address?

Not attempted

10

Which architecture introduced the idea of 'Inception modules' with multiple parallel conv paths?

Not attempted

Deep Learning & CNNs

Convolutional Neural Networks

1. Convolutional Layer

Output Size Formula

Parameter Count

2. Pooling Layer

3. Fully Connected Layer

Modern Architectures

Transfer Learning

Fine-Tuning Strategies

Deep Learning Techniques

Example: Image Classification Architecture

Key Design Choices

Training Setup

Convolution Operation: Mathematical Analysis

2D Discrete Convolution

Output Size Formula Derivation

Receptive Field Calculation

Backpropagation Through Convolution Layers

Forward Pass Notation

Gradient w.r.t. Kernel: ∂L/∂K

Gradient w.r.t. Input: ∂L/∂X

Gradient w.r.t. Bias

Implementation Insight

Batch Normalization Mathematical Derivation

Forward Pass Algorithm

Backward Pass: Gradient Derivation

Why Batch Norm Works

Residual Connections: Gradient Flow Analysis

Residual Block Formulation

Gradient Flow Through Skip Connections

Why This Solves Vanishing Gradients

Mathematical Intuition

Practice Quiz