In popular retellings of AI history, one straight line nearly derailed the whole field. That line is the decision boundary of a single-layer perceptron.
The historical story is slightly more nuanced than the myth. Minsky and Papert did not single-handedly cause the AI winter, but their critique of perceptrons crystallized a real limitation: a purely linear classifier cannot solve even simple nonlinearly separable problems such as XOR.
Why XOR mattered so much
The XOR truth table is tiny:
A single perceptron computes
The boundary is a line in two dimensions. XOR places the positive class on opposite corners and the negative class on the remaining two corners. No single line can separate them.
This was devastating because the perceptron had been sold as an early model of machine intelligence. If it failed on a toy logical pattern, what else could it not do?
Why stacking linear layers does not solve the problem
The next beginner instinct is straightforward: if one linear layer is too weak, stack many of them. But that does not change the function class.
Consider two affine layers:
Substitute the first into the second:
That is still just one affine map. Add ten more affine layers and the conclusion stays the same. Composition of affine transformations is affine.
Depth without nonlinearity is algebraic theater. It looks deeper, but it collapses to a single affine transformation.
Activation functions are what break linear collapse
Insert a nonlinear activation between layers:
Now the model can no longer be collapsed into a single affine expression. The activation bends the representation space. Hidden units can carve the input into regions, and later layers can recombine those regions into more complex decision boundaries.
This is the real reason multilayer networks became interesting again. The hidden layers were not enough by themselves. The nonlinear activations were the missing ingredient.
Sigmoid: elegant, useful, and problematic
The classic sigmoid function is
Sigmoid maps real numbers into , which is perfect for probabilities. But its derivative has a hard upper bound. At , the derivative is
In a deep network, backpropagation multiplies derivatives through many layers. If each layer contributes a factor smaller than one, the gradient can shrink exponentially. Ten layers of already give
That is the vanishing-gradient problem in one line of arithmetic.
Tanh helped, but did not fix the core issue
The hyperbolic tangent is
Compared with sigmoid, is zero-centered, which often makes optimization cleaner. But for large positive or negative inputs, it still saturates, and its derivative still approaches zero. It improved training dynamics, but it did not fundamentally remove vanishing gradients in deep stacks.
ReLU changed the engineering reality
The rectified linear unit is brutally simple:
Two properties made ReLU decisive:
- On the positive side, the derivative is exactly , so gradients can pass through without shrinking.
- The computation is cheap: no exponentials, just a thresholding operation.
This did not magically solve every optimization problem, but it made deep feedforward networks dramatically easier to train and helped open the door to the modern deep learning era.
ReLU has its own failure mode
ReLU is not perfect. If a neuron's pre-activation stays negative for every batch, the unit outputs zero and its gradient also stays zero. That neuron becomes inactive and may never recover. This is the dead-ReLU problem.
A standard patch is to let the negative side keep a small slope:
In Leaky ReLU, is fixed, often . In PReLU, the negative-side slope becomes a learned parameter. The principle is the same: do not let the negative region go completely flat.
The deeper lesson
Activation functions matter twice over.
- They provide the nonlinearity that gives multilayer networks real expressive power.
- Their derivatives control whether gradients survive backpropagation through depth.
That is why the history of deep learning is partly a story about activation design. A perceptron without nonlinearity was too rigid. A deep stack of saturating units was too hard to optimize. ReLU and its relatives hit a much better compromise between expressivity and trainability.
So yes, one straight line really did matter. Not because lines are inherently bad, but because a network built only from linear pieces can never learn the kinds of curved structure real tasks require.