Explore the fundamental building blocks of neural networks, from biological inspiration to mathematical models
Artificial neural networks draw inspiration from biological neurons in the brain. A biological neuron receives signals through dendrites, processes them in the cell body, and transmits output through the axon to other neurons via synapses.
The first mathematical model of an artificial neuron, proposed in 1943, laid the foundation for all modern neural networks. The M-P model demonstrates that simple binary threshold units can perform complex logical computations.
An artificial neuron computes a weighted sum of its inputs and applies an activation function:
Net Input: z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
Or in vector form: z = wTx + b
Output: y = f(z)
Inputs (x)
Feature values from data (e.g., income, age, credit score)
Weights (w)
Learned parameters determining input importance
Bias (b)
Threshold adjustment term (like an intercept)
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without activation functions, multiple layers would collapse into a single linear transformation.
The original activation function used in perceptrons. Outputs binary values based on a threshold.
f(z) = 1 if z ≥ 0
f(z) = 0 if z < 0
Use case: Theoretical understanding, binary classification in single-layer networks
Smooth S-shaped function that squashes values between 0 and 1. Widely used in early neural networks and logistic regression.
σ(z) = 1 / (1 + e-z)
Range: (0, 1)
Derivative: σ(z) · (1 - σ(z))
Use case: Binary classification output layer, probability estimation
Similar to sigmoid but zero-centered, squashing values between -1 and 1. Often performs better than sigmoid in hidden layers.
tanh(z) = (ez - e-z) / (ez + e-z)
Range: (-1, 1)
Derivative: 1 - tanh²(z)
Use case: Hidden layers in shallow networks, RNN/LSTM gates
The default choice for modern deep learning. Extremely simple yet effective: outputs input if positive, otherwise zero.
ReLU(z) = max(0, z)
Range: [0, ∞)
Derivative: 1 if z > 0, else 0
Use case: Default choice for hidden layers in deep networks, CNNs
Variants: Leaky ReLU (allows small negative gradient), Parametric ReLU (learnable slope), ELU (exponential linear unit) - all designed to address the dying ReLU problem while maintaining computational efficiency.
| Function | Range | Zero-Centered | Vanishing Gradient | Best Use |
|---|---|---|---|---|
| Step | {0, 1} | No | Yes (zero everywhere) | Theoretical only |
| Sigmoid | (0, 1) | No | Yes (at extremes) | Output layer (binary) |
| Tanh | (-1, 1) | Yes | Yes (at extremes) | Hidden layers (shallow) |
| ReLU | [0, ∞) | No | No (for z > 0) | Hidden layers (deep) ⭐ |
The perceptron learning rule is one of the earliest and simplest supervised learning algorithms. It iteratively adjusts weights based on errors, converging to a solution for linearly separable problems.
For each training sample, if the prediction is incorrect, update the weights:
wi ← wi + Δwi
Δwi = η · (y - ŷ) · xi
η (eta)
Learning rate, typically 0.01-0.1
y
True label (0 or 1)
ŷ
Predicted label
xi
Input feature value
Initialize weights and bias
Set all weights to small random values (e.g., from -0.1 to 0.1) or zeros
For each training sample
Iterate through all samples in the training set
Compute output
Calculate ŷ = step(wTx + b)
Update weights if error exists
If ŷ ≠ y, apply the perceptron learning rule to adjust weights
Repeat until convergence
Continue iterating until no errors occur or maximum epochs reached
The perceptron convergence theorem states that if the training data is linearly separable, the perceptron learning algorithm is guaranteed to find a separating hyperplane in a finite number of steps. However, for non-linearly separable data (like XOR), the algorithm will never converge.
Let's apply the perceptron to a real-world problem: predicting whether a loan application will be approved based on applicant features.
A dataset of 200 credit applications with features for approval prediction. Each application is labeled as approved (1) or rejected (0).
| ID | Income ($k) | Debt ($k) | Age | Credit Score | Approved |
|---|---|---|---|---|---|
| 1 | 75 | 15 | 34 | 720 | Yes (1) |
| 2 | 42 | 28 | 45 | 580 | No (0) |
| 3 | 95 | 8 | 28 | 780 | Yes (1) |
| 4 | 38 | 32 | 52 | 520 | No (0) |
| 5 | 68 | 18 | 39 | 690 | Yes (1) |
| 6 | 52 | 22 | 41 | 640 | Yes (1) |
| ... | 196 more samples in full dataset | ||||
Consider training a perceptron on the first sample with initial weights w = [0.1, -0.2, 0.05, 0.3] and bias b = -0.5, learning rate η = 0.01:
Step 1: Forward Pass
z = (0.1 × 75) + (-0.2 × 15) + (0.05 × 34) + (0.3 × 720) - 0.5
z = 7.5 - 3.0 + 1.7 + 216.0 - 0.5 = 221.7
ŷ = step(221.7) = 1 (correct! matches true label y = 1)
Step 2: Weight Update
Since ŷ = y, error = 0, so no weight update needed. The perceptron correctly classified this sample!
The XOR (exclusive OR) problem exposed the critical limitation of single-layer perceptrons and triggered the first AI winter. This seemingly simple problem cannot be solved by any linear classifier.
| x₁ | x₂ | Output |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
XOR returns 1 when inputs differ, 0 when they're the same
The XOR problem is not linearly separable. You cannot draw a single straight line (or hyperplane) to separate the positive and negative examples.
Geometric Insight: Points (0,1) and (1,0) should output 1, while (0,0) and (1,1) should output 0. These points form a diagonal pattern that cannot be separated by any single line.
This limitation applies to all single-layer linear classifiers, not just perceptrons.
AND Function ✓
Linearly separable - perceptron converges easily
OR Function ✓
Linearly separable - perceptron converges easily
XOR Function ✗
NOT linearly separable - perceptron fails
The XOR problem can be solved by adding a hidden layer, creating a multi-layer perceptron. The hidden layer learns non-linear feature combinations that make the problem linearly separable in a higher-dimensional space. This realization, combined with the backpropagation algorithm in 1986, revived neural network research and enabled modern deep learning.
Artificial neurons mimic biological neurons with weighted inputs, summation, and activation functions
Activation functions introduce non-linearity; ReLU is the modern default for deep networks
Perceptron learning adjusts weights based on prediction errors, converging for linearly separable problems
Single-layer limitations: Cannot solve non-linearly separable problems like XOR
Multi-layer networks overcome these limitations by learning hierarchical representations
Real-world applications like credit approval demonstrate practical utility