MathIsimple
Deep Learning
11 min read

PyTorch Softmax Regression: Why the Code Never Calls Softmax

Why logits go straight into CrossEntropyLoss, and why that is the numerically stable thing to do.

PyTorchSoftmax RegressionCrossEntropyLossLogitsNumerical Stability

The first time you implement Softmax regression in PyTorch, something feels off. The model is clearly doing multiclass classification, but the code never calls softmax.

That is not an omission. It is a deliberate numerical-stability choice. During training, PyTorch wants raw logits, not hand-normalized probabilities, because the stable form of the objective does not actually need an explicit Softmax step.

Diagram showing logits flowing into a stable cross-entropy objective via LogSumExp instead of an explicit softmax call.
Training uses logits directly because the stable loss is written in terms of LogSumExp, not explicit probabilities.

Why naive Softmax is numerically dangerous

For logits z=(z1,,zq)\mathbf{z} = (z_1, \dots, z_q), Softmax is

pj=ezjk=1qezkp_j = \frac{e^{z_j}}{\sum_{k=1}^{q} e^{z_k}}

On paper, this is harmless. On real hardware, it can fall apart quickly.

If one logit is very large, say zj=100z_j = 100, then ezje^{z_j} may overflow to inf. That is the first failure mode. The standard workaround is to subtract the maximum logit before exponentiating:

pj=ezjzmaxk=1qezkzmaxp_j = \frac{e^{z_j - z_{\max}}}{\sum_{k=1}^{q} e^{z_k - z_{\max}}}

This fixes overflow because the largest exponent becomes e0=1e^0 = 1. But once you later feed those probabilities into cross-entropy, a second problem appears. Very negative logits can underflow so aggressively that their exponent becomes numerical zero. If the correct class receives that zero and you take a logarithm, you get log0=\log 0 = -\infty.

That is why the practical issue is not Softmax alone. It is the Softmax-plus-log pipeline that appears in cross-entropy training.

The stable loss never needs explicit probabilities

Suppose the correct class is yy. The negative log-likelihood of the Softmax model is

L(z,y)=logezyk=1qezk\mathcal{L}(\mathbf{z}, y) = -\log \frac{e^{z_y}}{\sum_{k=1}^{q} e^{z_k}}

Expand the logarithm and the expensive-looking fraction simplifies immediately:

L(z,y)=zy+log(k=1qezk)\mathcal{L}(\mathbf{z}, y) = -z_y + \log\left(\sum_{k=1}^{q} e^{z_k}\right)

This is the key identity. Once you write the loss this way, you no longer need to materialize the probability vector first. You only need the correct-class logit and a stable computation of

LogSumExp(z)=log(k=1qezk)\operatorname{LogSumExp}(\mathbf{z}) = \log\left(\sum_{k=1}^{q} e^{z_k}\right)

PyTorch computes this with the usual max-subtraction trick internally:

LogSumExp(z)=zmax+log(k=1qezkzmax)\operatorname{LogSumExp}(\mathbf{z}) = z_{\max} + \log\left(\sum_{k=1}^{q} e^{z_k - z_{\max}}\right)

Subtracting zmaxz_{\max} ensures every exponent is non-positive, which eliminates overflow. The key gain is avoiding the unstable sequence of explicit Softmax followed by a logarithm — it does not eliminate all floating-point error, but it removes the specific overflow-then-underflow failure mode that makes naïve cross-entropy numerically unreliable.

So when you pass logits into nn.CrossEntropyLoss, PyTorch is effectively computing a stable log_softmax followed by negative log-likelihood, all in one go.

Softmax regression is still the model. The framework just evaluates the training objective in a more stable algebraic form.

That is why the model outputs logits

The sample below follows the D2L textbook style and uses the d2l helper library from Dive into Deep Learning. The math is completely standard, but some of the boilerplate is hidden inside those helper functions. If you want the fully manual version, the companion article on pure PyTorch MLP training walks through the explicit loop.

In that D2L implementation, the model is just a flattened linear classifier:

import torch
from torch import nn
from d2l import torch as d2l

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10))

def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.normal_(m.weight, std=0.01)

net.apply(init_weights)

loss = nn.CrossEntropyLoss(reduction='none')
trainer = torch.optim.SGD(net.parameters(), lr=0.1)

num_epochs = 10
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

The final layer is nn.Linear(784, 10), so the model returns ten raw scores. Those scores are logits. They are not probabilities, and they are not supposed to be.

The loss function is the component that interprets those logits probabilistically. That separation is mathematically cleaner and numerically safer.

What each part of the code is really doing

1. Data loading

load_data_fashion_mnist(batch_size) returns iterators that stream mini-batches from Fashion-MNIST. The batch size controls the trade-off between noisy updates and memory cost. A batch of one sample is too noisy; the full dataset is too expensive. Mini-batches live in the middle.

2. Model definition

The classifier itself is

z=Wx+b\mathbf{z} = W\mathbf{x} + \mathbf{b}

After flattening a 28×2828 \times 28 image into a vector in R784\mathbb{R}^{784}, the linear layer maps it to R10\mathbb{R}^{10}. Those ten coordinates are the logits for the ten classes.

3. Weight initialization

The call

nn.init.normal_(m.weight, std=0.01)

initializes the linear weights with a small Gaussian distribution. The point is not to encode any structure. It is to keep the starting logits small enough that optimization begins in a stable regime.

4. Loss and optimizer

nn.CrossEntropyLoss(reduction='none') returns one loss per example rather than immediately averaging. In this D2L example, that is intentional: train_ch3 performs its own batch aggregation and metric bookkeeping. In a standalone PyTorch training loop, you would more often keep the default reduction='mean' so that loss.backward() operates on one scalar without extra manual reduction. The optimizer here is plain SGD:

θθηθL\theta \leftarrow \theta - \eta \nabla_{\theta} \mathcal{L}

where η\eta is the learning rate. The key implementation point is that the D2L helper loop aggregates the per-example losses before backpropagation. If you copy this snippet into your own script without that surrounding logic, call .mean() on the loss tensor before .backward(), or switch to the default reduction='mean'.

5. Training loop

D2L hides the explicit loop, but the logic is the usual one:

  1. Compute logits with the current parameters.
  2. Evaluate the stable cross-entropy objective from those logits.
  3. Backpropagate to obtain parameter gradients.
  4. Update the parameters with SGD.

When should you actually call Softmax?

During training, usually never. At inference time, maybe.

If you only need the predicted label, argmaxjzj\operatorname*{argmax}_j z_j is enough, because Softmax preserves the ordering of logits. The class with the largest logit is also the class with the largest probability.

If you want normalized class scores for display, logging, or downstream decision logic, then applying Softmax after the model output is fine:

logits = net(X)
probs = torch.softmax(logits, dim=1)

The key is just not to do this before feeding the output into CrossEntropyLoss.

The mathematical takeaway

PyTorch omits explicit Softmax in training code for a principled reason. The multiclass likelihood can be rewritten as

L(z,y)=zy+LogSumExp(z)\mathcal{L}(\mathbf{z}, y) = -z_y + \operatorname{LogSumExp}(\mathbf{z})

That form avoids the unstable sequence “exponentiate first, then take logarithms.” So the code looks like logits-only training, but mathematically it is still Softmax regression. The Softmax is simply hidden inside the stable loss computation, where it belongs.

Ask AI ✨