Why Deep Networks Need Activation Functions

If one layer is too weak, why not stack fifty linear layers and call it deep?

That idea sounds perfectly reasonable the first time you hear it. It is also mathematically wrong. If you stack affine layers and never insert a nonlinearity, the whole network collapses into a single affine transformation. Depth becomes an illusion.

Activation functions matter because they are what stops that collapse. They are not a decorative extra. They are the reason a deep network is actually deep.

One hundred affine layers still equal one affine map

Start with two layers:

\mathbf{h} = \mathbf{W}_1\mathbf{x} + \mathbf{b}_1

\mathbf{o} = \mathbf{W}_2\mathbf{h} + \mathbf{b}_2

Substitute the first equation into the second:

\mathbf{o} = \mathbf{W}_2(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (\mathbf{W}_2\mathbf{W}_1)\mathbf{x} + (\mathbf{W}_2\mathbf{b}_1 + \mathbf{b}_2)

The result still has the form

\mathbf{o} = \mathbf{W}_{\text{new}}\mathbf{x} + \mathbf{b}_{\text{new}}

So two affine layers without a nonlinearity are equivalent to one affine layer. The same argument works for ten layers, one hundred layers, or one thousand. Composition does not create new expressive power if every step stays affine.

Without nonlinearities, depth does not buy you a richer function class. It only buys you a more complicated way to write the same affine map.

Bias terms do not rescue the stack

Beginners often hear "linear layers collapse" and reply, "But neural-network layers have biases." True. That changes the word you should use, not the conclusion. A stack of affine maps is still a single affine map.

That is why this issue is structural, not cosmetic. The problem is not missing bias terms. The problem is missing nonlinearity.

Why that kills expressive power

Real-world problems are rarely affine. Image classification is not an affine function of raw pixels. Speech recognition is not an affine function of wave amplitudes. Language understanding is certainly not an affine function of token embeddings.

An affine model can tilt, shift, and stretch the input space, but it still draws a very restricted family of decision boundaries. If the task requires curved structure, nested interactions, or hierarchical features, an affine stack cannot learn it no matter how many layers you pile on.

This is the quiet reason early neural networks stalled. More layers alone were not enough. The network needed a mechanism that could break the affine collapse.

Depth starts to matter the moment you add a nonlinearity

Insert an activation function $\phi$ between layers:

\mathbf{h} = \phi(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1)

\mathbf{o} = \mathbf{W}_2\mathbf{h} + \mathbf{b}_2

Now the middle step is not affine anymore, so the composition cannot generally be collapsed into one matrix and one bias vector. The network can bend space, carve it into regions, and build richer features layer by layer.

That is the whole game. One layer might detect edges. A later layer might combine edges into corners. Another might combine corners into shapes. Another might combine shapes into objects. None of that hierarchy makes sense if every stage is forced to remain affine.

Technically, ReLU networks are still piecewise linear, but that is enough. Each layer creates new region boundaries, and composing many such layers yields a function that can approximate extremely complicated patterns.

Why ReLU became the default hidden-layer choice

The most common modern hidden-layer activation is the rectified linear unit:

\operatorname{ReLU}(x) = \max(0, x)

Its rule is almost embarrassingly simple:

If the input is positive, leave it alone.
If the input is negative, clamp it to zero.

ReLU became popular for two reasons. First, it is cheap to compute. Second, on the positive side its derivative is $1$ , which makes gradient flow much healthier than in older saturating activations.

Why sigmoid struggled in deep hidden layers

Before ReLU, hidden layers often used sigmoid or tanh. Those functions are smooth and intuitive, but they saturate. For large positive or negative inputs, the curve flattens out and the derivative becomes tiny.

That matters during backpropagation. Gradients are multiplied repeatedly as they move backward through the network. If each layer contributes a factor smaller than one, the signal can shrink exponentially. Early layers then learn painfully slowly. That is the vanishing-gradient problem.

ReLU is not a universal cure, but it avoids that specific saturation behavior on its active side. That one design choice made very deep feedforward networks much easier to train.

ReLU is useful, not magical

ReLU has its own failure mode: if a neuron's pre-activation stays negative, the unit outputs zero and its gradient can stay zero too. That is the "dead ReLU" problem. Variants such as Leaky ReLU, GELU, and SiLU try to soften that edge in different ways.

It is also worth keeping perspective. Sigmoid did not disappear. It is still the right tool for binary probabilities and for gating mechanisms in recurrent architectures. It simply stopped being the default hidden-layer activation in deep feedforward networks.

The takeaway

If you remove the activation functions, a deep network loses the property that makes depth worthwhile. The model collapses into one affine map, no matter how many layers you stack.

Nonlinear activations are the point where representation learning begins. They stop the collapse, let the network build hierarchy, and give deep models the expressive power that shallow affine models do not have.

In other words: the architecture can be deep only if the function class is nonlinear. The depth is the structure. The activation is the soul.

Why Deep Networks Need Activation Functions

One hundred affine layers still equal one affine map

Bias terms do not rescue the stack

Why that kills expressive power

Depth starts to matter the moment you add a nonlinearity

Why ReLU became the default hidden-layer choice

Why sigmoid struggled in deep hidden layers

ReLU is useful, not magical

The takeaway

Related reading

Gradient Descent, Explained from First Principles

The Chain Rule in Deep Learning: What Backpropagation Is Really Computing

Why Linear Regression Uses Squared Error