Master perceptrons, multi-layer networks, activation functions, and backpropagation
A perceptron is a binary linear classifier that learns decision boundaries:
• = input vector
• = weight vector
• = bias term
• if , else
Update weights when prediction is wrong:
where is the learning rate
Range: (0, 1)
Derivative:
✓ Smooth, differentiable
✓ Interpretable as probability
✗ Vanishing gradient problem
✗ Not zero-centered
Range: (-1, 1)
Derivative:
✓ Zero-centered
✓ Stronger gradient than sigmoid
✗ Still suffers from vanishing gradient
Range: [0, ∞)
Derivative:
✓ No vanishing gradient for $z > 0$
✓ Fast to compute
✓ Most popular for deep networks
✗ Dying ReLU problem
Range: (-∞, ∞)
Derivative:
✓ Fixes dying ReLU problem
✓ Small negative slope allows learning
Multi-layer networks overcome perceptron limitations by stacking layers with non-linear activations:
Input Layer:
Hidden Layer :
Output Layer:
A feedforward network with:
can approximate any continuous function on a compact set to arbitrary precision.
Input Layer: neurons
Number of features
Hidden Layers:
Hyperparameters to tune
Output Layer: neurons
Number of classes/outputs
Backpropagation efficiently computes gradients using the chain rule, propagating errors backward:
1. Forward Pass:
Compute activations for all layers:
2. Compute Loss:
3. Output Layer Gradient:
4. Backward Pass (layer ):
5. Compute Weight Gradients:
6. Update Weights:
For nested functions :
Backpropagation applies this recursively through network layers
Batch Gradient Descent
Use all samples per update
✓ Stable ✗ Slow for large datasets
Stochastic Gradient Descent (SGD)
Use one sample per update
✓ Fast updates ✗ Noisy convergence
Mini-Batch GD
Use batch of size 32, 64, 128...
✓ Best balance (most common)
Vanishing Gradients
Solutions: ReLU, residual connections, batch norm
Exploding Gradients
Solutions: Gradient clipping, weight initialization
Overfitting
Solutions: Dropout, L2 regularization, early stopping
Slow Convergence
Solutions: Adam optimizer, learning rate scheduling
XOR cannot be solved by a single perceptron (not linearly separable), but a 2-layer network solves it easily:
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Input: 2 neurons
Hidden: 2 neurons (sigmoid activation)
Output: 1 neuron (sigmoid activation)
Result:
After training, the network learns to classify all 4 XOR patterns correctly
Hidden layer learns two decision boundaries. First neuron learns OR, second learns NAND. Output layer combines them: . This demonstrates that hidden layers create feature representations that make problems linearly separable.
Consider a 3-layer network: input → hidden → output
Layer 1 (Hidden):
Layer 2 (Output):
Loss function: (MSE for simplicity)
Goal: Compute ∂L/∂W^(2) and ∂L/∂b^(2)
Chain rule:
Term 1: Loss gradient
Term 2: Activation gradient (σ'(z) = σ(z)(1-σ(z)))
Term 3: Pre-activation gradient
Combining:
Goal: Compute ∂L/∂W^(1) and ∂L/∂b^(1)
Chain rule through layers:
We already have: δ^(2) = ∂L/∂z^(2)
Backpropagate to a^(1):
This is the key step: errors propagate backward through weights!
Through activation:
Local gradient:
Final Hidden Layer Gradient:
Forward Pass
for layer l = 1 to L:
z^(l) = W^(l) a^(l-1) + b^(l)
a^(l) = σ(z^(l))
compute loss L(a^(L), y)
Backward Pass
δ^(L) = ∇_a L ⊙ σ'(z^(L))
for layer l = L-1 down to 1:
δ^(l) = [(W^(l+1))^T δ^(l+1)] ⊙ σ'(z^(l))
Gradients
∂L/∂W^(l) = δ^(l) (a^(l-1))^T
∂L/∂b^(l) = δ^(l)
Function:
Range: (0, 1)
Derivative:
Max: 0.25 at z=0
Problem: Vanishing gradients (σ' → 0 for large |z|)
Function:
Range: (-1, 1)
Derivative:
Max: 1 at z=0
Advantage: Zero-centered (mean output ≈ 0), stronger gradient than sigmoid
Relation to Sigmoid:
Function:
Range: [0, ∞)
Derivative:
✓ Pro: No vanishing gradient (z > 0), computationally cheap, sparse activation
✗ Con: Dead neurons (if z < 0 always, gradient = 0 forever)
Leaky ReLU:
Fixes dead neuron problem by allowing small gradient for negative values.
Parametric ReLU (PReLU):
α is learned during training instead of fixed.
For L-layer network, gradient at layer 1:
Each term in product:
When: If ||W^(l)|| < 1 and/or σ'(z) < 1
Example with Sigmoid:
Result: Early layers learn extremely slowly or not at all!
When: If ||W^(l)|| > 1 significantly
Example:
Result: Weights oscillate wildly, training diverges, NaN values!
For Vanishing:
For Exploding:
A feedforward network with:
can approximate any continuous function f: R^n → R^m on compact domain to arbitrary precision.
Formally:
where K is compact, hat f is the neural network approximation with N hidden units.
Neural networks approximate functions by combining simple pieces:
Example: Approximating Step Function
Steep sigmoid ≈ step. Sum of shifted steps ≈ any piecewise constant function.
Deep networks can be exponentially more efficient:
Some functions that require O(2^n) neurons in one hidden layer can be represented with O(n) neurons per layer across O(log n) layers.
This partially explains why deep networks work better in practice!
Test your understanding with 10 multiple-choice questions