Machine Learning/Learning Center/Neural Networks/Neuron Models & Perceptron

Neuron Models & Perceptron

Explore the fundamental building blocks of neural networks, from biological inspiration to mathematical models

From Biological Neurons to Artificial Models

Artificial neural networks draw inspiration from biological neurons in the brain. A biological neuron receives signals through dendrites, processes them in the cell body, and transmits output through the axon to other neurons via synapses.

Biological Neuron

•Dendrites: Receive input signals from other neurons
•Cell body (soma): Processes and integrates incoming signals
•Axon: Transmits output signal when threshold is reached
•Synapses: Connection points with varying strengths

Artificial Neuron

•Inputs: Numerical values representing features (x₁, x₂, ...)
•Weights: Connection strengths (w₁, w₂, ...) learned from data
•Summation: Weighted sum of inputs plus bias term
•Activation function: Non-linear transformation producing output

McCulloch-Pitts (M-P) Neuron Model

The first mathematical model of an artificial neuron, proposed in 1943, laid the foundation for all modern neural networks. The M-P model demonstrates that simple binary threshold units can perform complex logical computations.

Mathematical Formulation

An artificial neuron computes a weighted sum of its inputs and applies an activation function:

Net Input: z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

Or in vector form: z = w^Tx + b

Output: y = f(z)

Inputs (x)

Feature values from data (e.g., income, age, credit score)

Weights (w)

Learned parameters determining input importance

Bias (b)

Threshold adjustment term (like an intercept)

Key Properties

Weighted connections: Different inputs have different importance (weights)
Summation: Combines all inputs linearly before activation
Threshold behavior: Activation function introduces non-linearity
Learning capability: Weights adjusted to minimize errors

Activation Functions

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without activation functions, multiple layers would collapse into a single linear transformation.

Step

Step Function (Heaviside)

The original activation function used in perceptrons. Outputs binary values based on a threshold.

f(z) = 1 if z ≥ 0
f(z) = 0 if z < 0

Pro: Simple, interpretable binary output

Con: Not differentiable, can't use gradient descent

Con: Zero gradient everywhere (can't train deep networks)

Use case: Theoretical understanding, binary classification in single-layer networks

Sigmoid

Sigmoid Function (Logistic)

Smooth S-shaped function that squashes values between 0 and 1. Widely used in early neural networks and logistic regression.

σ(z) = 1 / (1 + e^-z)

Range: (0, 1)
Derivative: σ(z) · (1 - σ(z))

Pro: Smooth, differentiable everywhere

Pro: Output interpretable as probability

Con: Vanishing gradient problem (saturates at extremes)

Con: Not zero-centered (optimization issues)

Use case: Binary classification output layer, probability estimation

Tanh

Hyperbolic Tangent (tanh)

Similar to sigmoid but zero-centered, squashing values between -1 and 1. Often performs better than sigmoid in hidden layers.

tanh(z) = (e^z - e^-z) / (e^z + e^-z)

Range: (-1, 1)
Derivative: 1 - tanh²(z)

Pro: Zero-centered (better gradient flow)

Pro: Stronger gradients than sigmoid

Con: Still suffers from vanishing gradient

Con: Computationally expensive (exponentials)

Use case: Hidden layers in shallow networks, RNN/LSTM gates

ReLU

Rectified Linear Unit (ReLU) ⭐ Most Popular

The default choice for modern deep learning. Extremely simple yet effective: outputs input if positive, otherwise zero.

ReLU(z) = max(0, z)

Range: [0, ∞)
Derivative: 1 if z > 0, else 0

Pro: No vanishing gradient for positive values

Pro: Computationally efficient (simple max operation)

Pro: Enables sparse activation (many neurons output 0)

Con: "Dying ReLU" problem (neurons can get stuck at 0)

Use case: Default choice for hidden layers in deep networks, CNNs

Variants: Leaky ReLU (allows small negative gradient), Parametric ReLU (learnable slope), ELU (exponential linear unit) - all designed to address the dying ReLU problem while maintaining computational efficiency.

Quick Comparison

Function	Range	Zero-Centered	Vanishing Gradient	Best Use
Step	{0, 1}	No	Yes (zero everywhere)	Theoretical only
Sigmoid	(0, 1)	No	Yes (at extremes)	Output layer (binary)
Tanh	(-1, 1)	Yes	Yes (at extremes)	Hidden layers (shallow)
ReLU	[0, ∞)	No	No (for z > 0)	Hidden layers (deep) ⭐

Perceptron Learning Algorithm

The perceptron learning rule is one of the earliest and simplest supervised learning algorithms. It iteratively adjusts weights based on errors, converging to a solution for linearly separable problems.

Learning Rule

For each training sample, if the prediction is incorrect, update the weights:

w_i ← w_i + Δw_i

Δw_i = η · (y - ŷ) · x_i

η (eta)

Learning rate, typically 0.01-0.1

True label (0 or 1)

Predicted label

x_i

Input feature value

Algorithm Steps

Initialize weights and bias

Set all weights to small random values (e.g., from -0.1 to 0.1) or zeros

For each training sample

Iterate through all samples in the training set

Compute output

Calculate ŷ = step(w^Tx + b)

Update weights if error exists

If ŷ ≠ y, apply the perceptron learning rule to adjust weights

Repeat until convergence

Continue iterating until no errors occur or maximum epochs reached

Convergence Guarantee

The perceptron convergence theorem states that if the training data is linearly separable, the perceptron learning algorithm is guaranteed to find a separating hyperplane in a finite number of steps. However, for non-linearly separable data (like XOR), the algorithm will never converge.

Practical Example: Credit Approval Classification

Let's apply the perceptron to a real-world problem: predicting whether a loan application will be approved based on applicant features.

Dataset: Credit Application Dataset

A dataset of 200 credit applications with features for approval prediction. Each application is labeled as approved (1) or rejected (0).

ID	Income ($k)	Debt ($k)	Age	Credit Score	Approved
1	75	15	34	720	Yes (1)
2	42	28	45	580	No (0)
3	95	8	28	780	Yes (1)
4	38	32	52	520	No (0)
5	68	18	39	690	Yes (1)
6	52	22	41	640	Yes (1)
...	196 more samples in full dataset

Feature Description

Income: Annual income in thousands (continuous, 25-150k)
Debt: Total outstanding debt in thousands (continuous, 0-50k)
Age: Applicant age in years (discrete, 22-65)
Credit Score: FICO credit score (discrete, 300-850)
Label: Approval decision (binary: 0=rejected, 1=approved)

Dataset Statistics

Total samples: 200 applications
Approved: 118 (59%)
Rejected: 82 (41%)
Features: 4 numerical inputs
Task: Binary classification

Training Process Example

Consider training a perceptron on the first sample with initial weights w = [0.1, -0.2, 0.05, 0.3] and bias b = -0.5, learning rate η = 0.01:

Step 1: Forward Pass

z = (0.1 × 75) + (-0.2 × 15) + (0.05 × 34) + (0.3 × 720) - 0.5

z = 7.5 - 3.0 + 1.7 + 216.0 - 0.5 = 221.7

ŷ = step(221.7) = 1 (correct! matches true label y = 1)

Step 2: Weight Update

Since ŷ = y, error = 0, so no weight update needed. The perceptron correctly classified this sample!

The XOR Problem: Perceptron's Fundamental Limitation

The XOR (exclusive OR) problem exposed the critical limitation of single-layer perceptrons and triggered the first AI winter. This seemingly simple problem cannot be solved by any linear classifier.

XOR Truth Table

x₁	x₂	Output
0	0	0
0	1	1
1	0	1
1	1	0

XOR returns 1 when inputs differ, 0 when they're the same

Why Perceptrons Fail

The XOR problem is not linearly separable. You cannot draw a single straight line (or hyperplane) to separate the positive and negative examples.

Geometric Insight: Points (0,1) and (1,0) should output 1, while (0,0) and (1,1) should output 0. These points form a diagonal pattern that cannot be separated by any single line.

This limitation applies to all single-layer linear classifiers, not just perceptrons.

Simple Problems Perceptrons CAN Solve

AND Function ✓

Linearly separable - perceptron converges easily

OR Function ✓

Linearly separable - perceptron converges easily

XOR Function ✗

NOT linearly separable - perceptron fails

The Solution: Multi-Layer Networks

The XOR problem can be solved by adding a hidden layer, creating a multi-layer perceptron. The hidden layer learns non-linear feature combinations that make the problem linearly separable in a higher-dimensional space. This realization, combined with the backpropagation algorithm in 1986, revived neural network research and enabled modern deep learning.

Key Takeaways

Artificial neurons mimic biological neurons with weighted inputs, summation, and activation functions

Activation functions introduce non-linearity; ReLU is the modern default for deep networks

Perceptron learning adjusts weights based on prediction errors, converging for linearly separable problems

Single-layer limitations: Cannot solve non-linearly separable problems like XOR

Multi-layer networks overcome these limitations by learning hierarchical representations

Real-world applications like credit approval demonstrate practical utility