Softmax and Cross-Entropy, Without the Hand-Waving

A classifier spits out three numbers: 4.2, 1.1, and -0.7. Those numbers are not probabilities. They can be negative. They do not sum to one. And yet they are exactly what you want right before the last step.

The output layer of a neural network is a short pipeline, not a single operation. First the model produces raw scores. Then those scores become a probability distribution. Then the loss function measures how much probability the model gave to the correct answer. Softmax and cross-entropy live in that pipeline together.

Once you see the pieces separately, the whole thing becomes much less mystical.

Every multiclass classifier ends with the same three-step stack

For a mutually exclusive classification problem, the output layer usually looks like this:

\text{logits} = \mathbf{o} = \mathbf{W}\mathbf{x} + \mathbf{b}

\text{probabilities} = \hat{\mathbf{y}} = \operatorname{softmax}(\mathbf{o})

\text{loss} = -\sum_{j=1}^{q} y_j \log \hat{y}_j

That is the whole recipe:

Linear scores tell you how much evidence the model has for each class.
Softmax turns those scores into a probability distribution.
Cross-entropy punishes the model when the correct class gets low probability.

These raw scores are often called logits. They are not supposed to be human-readable yet. Their job is to carry relative preference information before normalization.

Why labels become one-hot vectors

If the classes are cat, dog, and bird, using the integers 0, 1, and 2 is misleading. That encoding accidentally suggests an ordering and a distance structure that the task does not have. Dog is not "between" cat and bird.

The standard fix is one-hot encoding:

Class	Target Vector
Cat	(1, 0, 0)
Dog	(0, 1, 0)
Bird	(0, 0, 1)

Exactly one entry is on, and the rest are off. That lets the loss focus all of its attention on the probability assigned to the correct class.

Softmax turns logits into probabilities

Given logits $o_1, \dots, o_q$ , Softmax is defined by

\hat{y}_j = \frac{e^{o_j}}{\sum_{k=1}^{q} e^{o_k}}

This does two things at once:

$e^{o_j}$ makes every weight positive.
Dividing by the total sum makes the outputs add up to $1$ .

The result is a proper probability distribution. The largest logit still gets the largest probability, but now the scores are normalized and comparable across classes.

For prediction, you usually take

\operatorname*{argmax}_j \hat{y}_j

But during training, the probability values matter much more than the final argmax.

Cross-entropy is maximum likelihood in disguise

Suppose the correct class is $c$ . Under one-hot encoding, the cross-entropy loss for one example is

\ell(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{j=1}^{q} y_j \log \hat{y}_j = -\log \hat{y}_c

That last equality is the whole convenience of one-hot labels. Every incorrect class has $y_j = 0$ , so its term vanishes. The loss only cares about how much probability the model gave to the right answer.

Probabilistically, this is just maximum likelihood. You are pushing the model to assign high probability to the class that actually occurred.

Cross-entropy is not an arbitrary punishment function. It is the negative log-likelihood of the observed class under the model's predicted distribution.

The gradient becomes prediction minus target

This is the part people remember because it is unusually clean. If you combine Softmax with cross-entropy and differentiate with respect to the logits, the gradient simplifies to

\frac{\partial \ell}{\partial o_j} = \hat{y}_j - y_j

You can derive it from first principles. Start with

\ell = -\sum_{j=1}^{q} y_j \log \hat{y}_j

\hat{y}_j = \frac{e^{o_j}}{\sum_k e^{o_k}}

Substituting the Softmax expression and simplifying gives

\ell = \log\left(\sum_{k=1}^{q} e^{o_k}\right) - \sum_{j=1}^{q} y_j o_j

Differentiate with respect to $o_j$ , and the first term yields $\hat{y}_j$ while the second yields $y_j$ . The result is the difference between what the model predicted and what the data said.

That form is not just pretty. It is what makes the output-layer gradient so easy to interpret. If the model assigns too much probability to a class, the gradient pushes that logit down. If it assigns too little, the gradient pushes that logit up.

Why argmax cannot train the model

You might wonder: if prediction only needs the largest score, why not skip Softmax and train directly with $\operatorname*{argmax}$ ?

Because $\operatorname*{argmax}$ is not useful for gradient-based learning. It is flat almost everywhere. Small changes to the logits usually do nothing to the output until one class overtakes another, and then the result changes abruptly. There is no smooth slope to follow.

Softmax fixes that by turning discrete winner-take-all behavior into a differentiable probability distribution. The model can now learn from near misses, not just from complete flips in the predicted label.

Why the exponential shows up

The exponential is not the only positive function in the world, but it is the one that makes the algebra work beautifully.

It gives you three advantages:

It maps arbitrary real-valued logits to positive weights.
It amplifies score differences multiplicatively, which sharpens class preference.
It pairs perfectly with the logarithm in cross-entropy, which is why the gradient collapses to $\hat{y}_j - y_j$ .

There is also a deeper statistical reason. The Softmax distribution belongs to the exponential family, which is exactly the family that keeps appearing when you ask for maximum-entropy distributions under linear constraints.

Sigmoid is the two-class special case

For two classes with logits $o_1$ and $o_2$ , the probability of class 1 under Softmax is

\hat{y}_1 = \frac{e^{o_1}}{e^{o_1} + e^{o_2}} = \frac{1}{1 + e^{-(o_1 - o_2)}}

That is just a sigmoid applied to the logit difference:

\sigma(z) = \frac{1}{1 + e^{-z}}

So logistic regression is the two-class case of Softmax regression. Same story, smaller output space.

One important caveat: if classes are not mutually exclusive, you should not use Softmax. In multilabel problems, each class gets its own sigmoid because several labels can be true at the same time.

The takeaway

The output layer is not mysterious once you separate its jobs:

The model first computes logits, which are just unconstrained scores.
Softmax turns those scores into a probability distribution over classes.
Cross-entropy measures how much probability the model assigned to the truth.
The combined gradient simplifies to prediction minus target.

That is why this combination appears everywhere. It gives you a principled probabilistic objective, a smooth training signal, and a gradient clean enough to make backpropagation efficient.