MathIsimple
Back to Machine Learning
Neural Networks
Intermediate Level

Neural Networks Basics

Master perceptrons, multi-layer networks, activation functions, and backpropagation

The Perceptron

The First Neural Network Model (1958)

A perceptron is a binary linear classifier that learns decision boundaries:

y=sign(i=1nwixi+b)=sign(wTx+b)y = \text{sign}(\sum_{i=1}^{n} w_i x_i + b) = \text{sign}(\mathbf{w}^T\mathbf{x} + b)

• = input vector

• = weight vector

• = bias term

• if , else

Learning Rule

Update weights when prediction is wrong:

wiwi+η(yy^)xiw_i \leftarrow w_i + \eta(y - \hat{y})x_ibb+η(yy^)b \leftarrow b + \eta(y - \hat{y})

where is the learning rate

Limitations

  • • Only learns linearly separable patterns
  • • Cannot solve XOR problem
  • • Single layer limits expressiveness
  • • Led to "AI Winter" in 1970s

Activation Functions

Sigmoid
σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Range: (0, 1)

Derivative:

σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z))

✓ Smooth, differentiable

✓ Interpretable as probability

✗ Vanishing gradient problem

✗ Not zero-centered

Tanh
tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Range: (-1, 1)

Derivative:

tanh(z)=1tanh2(z)\tanh'(z) = 1 - \tanh^2(z)

✓ Zero-centered

✓ Stronger gradient than sigmoid

✗ Still suffers from vanishing gradient

ReLU (Rectified Linear Unit)
ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Range: [0, ∞)

Derivative:

ReLU(z)={1if z>00if z0\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}

✓ No vanishing gradient for $z > 0$

✓ Fast to compute

✓ Most popular for deep networks

✗ Dying ReLU problem

Leaky ReLU
LeakyReLU(z)=max(0.01z,z)\text{LeakyReLU}(z) = \max(0.01z, z)

Range: (-∞, ∞)

Derivative:

f(z)={1if z>00.01if z0f'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0.01 & \text{if } z \leq 0 \end{cases}

✓ Fixes dying ReLU problem

✓ Small negative slope allows learning

Multi-Layer Perceptron (MLP)

Feedforward Neural Networks

Multi-layer networks overcome perceptron limitations by stacking layers with non-linear activations:

Input Layer:

a(0)=x\mathbf{a}^{(0)} = \mathbf{x}

Hidden Layer :

z(l)=W(l)a(l1)+b(l)\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}a(l)=f(z(l))\mathbf{a}^{(l)} = f(\mathbf{z}^{(l)})

Output Layer:

y^=a(L)\hat{\mathbf{y}} = \mathbf{a}^{(L)}

Universal Approximation Theorem

A feedforward network with:

  • • One hidden layer
  • • Finite number of neurons
  • • Non-linear activation

can approximate any continuous function on a compact set to arbitrary precision.

Network Architecture

Input Layer: neurons

Number of features

Hidden Layers:

Hyperparameters to tune

Output Layer: neurons

Number of classes/outputs

Backpropagation

Efficient Gradient Computation

Backpropagation efficiently computes gradients using the chain rule, propagating errors backward:

Algorithm Steps:

1. Forward Pass:

Compute activations for all layers:

2. Compute Loss:

L=1niL(yi,y^i)L = \frac{1}{n}\sum_{i} L(\mathbf{y}_i, \hat{\mathbf{y}}_i)

3. Output Layer Gradient:

δ(L)=Lz(L)=(y^y)f(z(L))\delta^{(L)} = \frac{\partial L}{\partial \mathbf{z}^{(L)}} = (\hat{\mathbf{y}} - \mathbf{y}) \odot f'(\mathbf{z}^{(L)})

4. Backward Pass (layer ):

δ(l)=(W(l+1)Tδ(l+1))f(z(l))\delta^{(l)} = (\mathbf{W}^{(l+1)T}\delta^{(l+1)}) \odot f'(\mathbf{z}^{(l)})

5. Compute Weight Gradients:

LW(l)=δ(l)a(l1)T\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \delta^{(l)}\mathbf{a}^{(l-1)T}Lb(l)=δ(l)\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}

6. Update Weights:

W(l)W(l)ηLW(l)\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(l)}}

Key Insight: Chain Rule

For nested functions :

dLdx=dLdf3df3df2df2df1df1dx\frac{dL}{dx} = \frac{dL}{df_3} \cdot \frac{df_3}{df_2} \cdot \frac{df_2}{df_1} \cdot \frac{df_1}{dx}

Backpropagation applies this recursively through network layers

Computational Efficiency

  • • Computes all gradients in time (W = # weights)
  • • Reuses forward pass activations
  • • Much faster than numerical differentiation
  • • Enabled deep learning revolution

Training Neural Networks

Gradient Descent Variants

Batch Gradient Descent

Use all samples per update

✓ Stable ✗ Slow for large datasets

Stochastic Gradient Descent (SGD)

Use one sample per update

✓ Fast updates ✗ Noisy convergence

Mini-Batch GD

Use batch of size 32, 64, 128...

✓ Best balance (most common)

Common Challenges

Vanishing Gradients

Solutions: ReLU, residual connections, batch norm

Exploding Gradients

Solutions: Gradient clipping, weight initialization

Overfitting

Solutions: Dropout, L2 regularization, early stopping

Slow Convergence

Solutions: Adam optimizer, learning rate scheduling

Example: XOR Problem

Why Multi-Layer Networks Matter

XOR cannot be solved by a single perceptron (not linearly separable), but a 2-layer network solves it easily:

XOR Truth Table

000
011
101
110

Network Architecture

Input: 2 neurons

Hidden: 2 neurons (sigmoid activation)

Output: 1 neuron (sigmoid activation)

Result:

After training, the network learns to classify all 4 XOR patterns correctly

Why It Works

Hidden layer learns two decision boundaries. First neuron learns OR, second learns NAND. Output layer combines them: . This demonstrates that hidden layers create feature representations that make problems linearly separable.

Backpropagation: Complete Mathematical Derivation

Chain Rule Application for Gradient Computation

Network Setup

Consider a 3-layer network: input → hidden → output

Layer 1 (Hidden):

z(1)=W(1)x+b(1)z^{(1)} = W^{(1)}x + b^{(1)}a(1)=σ(z(1))a^{(1)} = \sigma(z^{(1)})

Layer 2 (Output):

z(2)=W(2)a(1)+b(2)z^{(2)} = W^{(2)}a^{(1)} + b^{(2)}y^=σ(z(2))\hat{y} = \sigma(z^{(2)})

Loss function: L=12(y^y)2L = \frac{1}{2}(\hat{y} - y)^2 (MSE for simplicity)

Step 1: Output Layer Gradient

Goal: Compute ∂L/∂W^(2) and ∂L/∂b^(2)

Chain rule:

LW(2)=Ly^y^z(2)z(2)W(2)\frac{\partial L}{\partial W^{(2)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial W^{(2)}}

Term 1: Loss gradient

Ly^=y^y\frac{\partial L}{\partial \hat{y}} = \hat{y} - y

Term 2: Activation gradient (σ'(z) = σ(z)(1-σ(z)))

y^z(2)=y^(1y^)\frac{\partial \hat{y}}{\partial z^{(2)}} = \hat{y}(1 - \hat{y})

Term 3: Pre-activation gradient

z(2)W(2)=(a(1))T\frac{\partial z^{(2)}}{\partial W^{(2)}} = (a^{(1)})^T

Combining:

δ(2)=(y^y)y^(1y^)\delta^{(2)} = (\hat{y} - y) \odot \hat{y}(1-\hat{y})LW(2)=δ(2)(a(1))T\frac{\partial L}{\partial W^{(2)}} = \delta^{(2)} (a^{(1)})^TLb(2)=δ(2)\frac{\partial L}{\partial b^{(2)}} = \delta^{(2)}

Step 2: Hidden Layer Gradient (Backprop)

Goal: Compute ∂L/∂W^(1) and ∂L/∂b^(1)

Chain rule through layers:

LW(1)=Lz(2)z(2)a(1)a(1)z(1)z(1)W(1)\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial W^{(1)}}

We already have: δ^(2) = ∂L/∂z^(2)

Backpropagate to a^(1):

La(1)=(W(2))Tδ(2)\frac{\partial L}{\partial a^{(1)}} = (W^{(2)})^T \delta^{(2)}

This is the key step: errors propagate backward through weights!

Through activation:

a(1)z(1)=a(1)(1a(1))\frac{\partial a^{(1)}}{\partial z^{(1)}} = a^{(1)} \odot (1 - a^{(1)})

Local gradient:

z(1)W(1)=xT\frac{\partial z^{(1)}}{\partial W^{(1)}} = x^T

Final Hidden Layer Gradient:

δ(1)=[(W(2))Tδ(2)]a(1)(1a(1))\delta^{(1)} = [(W^{(2)})^T \delta^{(2)}] \odot a^{(1)} \odot (1 - a^{(1)})LW(1)=δ(1)xT\frac{\partial L}{\partial W^{(1)}} = \delta^{(1)} x^TLb(1)=δ(1)\frac{\partial L}{\partial b^{(1)}} = \delta^{(1)}

General Backpropagation Algorithm

Forward Pass

for layer l = 1 to L:

z^(l) = W^(l) a^(l-1) + b^(l)

a^(l) = σ(z^(l))

compute loss L(a^(L), y)

Backward Pass

δ^(L) = ∇_a L ⊙ σ'(z^(L))

for layer l = L-1 down to 1:

δ^(l) = [(W^(l+1))^T δ^(l+1)] ⊙ σ'(z^(l))

Gradients

∂L/∂W^(l) = δ^(l) (a^(l-1))^T

∂L/∂b^(l) = δ^(l)

Activation Functions: Derivatives & Properties

Mathematical Analysis of Common Activations

Sigmoid

Function:

σ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}}

Range: (0, 1)

Derivative:

σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z))

Max: 0.25 at z=0

Problem: Vanishing gradients (σ' → 0 for large |z|)

Tanh (Hyperbolic Tangent)

Function:

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Range: (-1, 1)

Derivative:

tanh(z)=1tanh2(z)\tanh'(z) = 1 - \tanh^2(z)

Max: 1 at z=0

Advantage: Zero-centered (mean output ≈ 0), stronger gradient than sigmoid

Relation to Sigmoid:

tanh(z)=2σ(2z)1\tanh(z) = 2\sigma(2z) - 1

ReLU (Rectified Linear Unit)

Function:

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Range: [0, ∞)

Derivative:

ReLU(z)={1z>00z0\text{ReLU}'(z) = \begin{cases} 1 & z > 0 \\ 0 & z \leq 0 \end{cases}

✓ Pro: No vanishing gradient (z > 0), computationally cheap, sparse activation

✗ Con: Dead neurons (if z < 0 always, gradient = 0 forever)

Leaky ReLU & Variants

Leaky ReLU:

LeakyReLU(z)={zz>0αzz0(α=0.01)\text{LeakyReLU}(z) = \begin{cases} z & z > 0 \\ \alpha z & z \leq 0 \end{cases} \quad (\alpha = 0.01)LeakyReLU(z)={1z>0αz0\text{LeakyReLU}'(z) = \begin{cases} 1 & z > 0 \\ \alpha & z \leq 0 \end{cases}

Fixes dead neuron problem by allowing small gradient for negative values.

Parametric ReLU (PReLU):

α is learned during training instead of fixed.

Vanishing & Exploding Gradients

Mathematical Analysis of Gradient Flow

Gradient Flow Through Layers

For L-layer network, gradient at layer 1:

LW(1)=Lz(L)l=2Lz(l)z(l1)\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial z^{(L)}} \cdot \prod_{l=2}^{L} \frac{\partial z^{(l)}}{\partial z^{(l-1)}}

Each term in product:

z(l)z(l1)=W(l)diag(σ(z(l1)))\frac{\partial z^{(l)}}{\partial z^{(l-1)}} = W^{(l)} \cdot \text{diag}(\sigma'(z^{(l-1)}))

Vanishing Gradient Problem

When: If ||W^(l)|| < 1 and/or σ'(z) < 1

l=2LW(l)diag(σ(z(l1)))l=2LW(l)maxiσ(zi(l1))\left\|\prod_{l=2}^{L} W^{(l)} \text{diag}(\sigma'(z^{(l-1)}))\right\| \leq \prod_{l=2}^{L} \|W^{(l)}\| \cdot \max_i|\sigma'(z_i^{(l-1)})|

Example with Sigmoid:

  • • σ'(z) ≤ 0.25 always
  • • If ||W|| ≈ 1, gradient scales by ~0.25^(L-1)
  • • For L=10: gradient ≈ 0.25^9 ≈ 3.8 × 10^-6

Result: Early layers learn extremely slowly or not at all!

Exploding Gradient Problem

When: If ||W^(l)|| > 1 significantly

W(1)LcLfor some c>1\|\nabla_{W^{(1)}} L\| \geq c^L \quad \text{for some } c > 1

Example:

  • • If ||W|| = 1.5, gradient scales by ~1.5^(L-1)
  • • For L=10: gradient ≈ 1.5^9 ≈ 38.4
  • • Can grow to 10^10 or more!

Result: Weights oscillate wildly, training diverges, NaN values!

Solutions

For Vanishing:

  • • Use ReLU activations (σ' = 1 for z > 0)
  • • Residual connections (skip connections)
  • • Batch normalization
  • • Careful weight initialization (He, Xavier)

For Exploding:

  • • Gradient clipping: if ||g|| > threshold, g ← g × threshold/||g||
  • • Weight regularization (L2 penalty)
  • • Lower learning rates
  • • Batch normalization

Universal Approximation Theorem

Theoretical Guarantee of Neural Network Expressiveness

Theorem Statement

A feedforward network with:

  • • A single hidden layer
  • • Finite number of neurons
  • • Non-polynomial activation function (e.g., sigmoid, tanh)

can approximate any continuous function f: R^n → R^m on compact domain to arbitrary precision.

Formally:

ϵ>0,N,w,b:supxKf(x)f^(x)<ϵ\forall \epsilon > 0, \exists N, \mathbf{w}, \mathbf{b}: \sup_{\mathbf{x} \in K} |f(\mathbf{x}) - \hat{f}(\mathbf{x})| < \epsilon

where K is compact, hat f is the neural network approximation with N hidden units.

Intuition: Building Blocks

Neural networks approximate functions by combining simple pieces:

  1. Each hidden neuron creates a "bump" or "step" function
  2. Linear combination of these bumps can approximate any shape
  3. More neurons = finer approximation

Example: Approximating Step Function

step(x)σ(k(xa))as k\text{step}(x) \approx \sigma(k(x-a)) \quad \text{as } k \to \infty

Steep sigmoid ≈ step. Sum of shifted steps ≈ any piecewise constant function.

Important Caveats

  • ✗ Doesn't guarantee efficient approximation
    May need exponentially many neurons for good approximation
  • ✗ Doesn't tell us how to find weights
    Existence ≠ tractability. Gradient descent may not find optimal weights
  • ✗ Only for functions, not learning
    Says nothing about generalization from finite training data
  • ✓ But it's a powerful theoretical foundation
    Shows NNs are at least as expressive as needed

Modern Extensions

Deep networks can be exponentially more efficient:

Some functions that require O(2^n) neurons in one hidden layer can be represented with O(n) neurons per layer across O(log n) layers.

This partially explains why deep networks work better in practice!

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is the output of a perceptron with weights , bias , and input using sign activation?
Not attempted
2
What is the range of the sigmoid activation function?
Not attempted
3
What is the derivative of the sigmoid function ?
Not attempted
4
Which activation function addresses the 'dying ReLU' problem?
Not attempted
5
In a neural network with layers of sizes [3, 4, 2], how many weight parameters are there (excluding biases)?
Not attempted
6
What does the Universal Approximation Theorem state?
Not attempted
7
In backpropagation, gradients are computed using:
Not attempted
8
For a network with cross-entropy loss and softmax output, what is the gradient for the output layer?
Not attempted
9
What is the vanishing gradient problem?
Not attempted
10
What is batch gradient descent?
Not attempted