A lot of deep learning math looks harder than it really is because each topic arrives with its own notation. Strip the notation away and the same patterns keep showing up.
Gaussian noise gives you squared error. Laplace noise gives you absolute error. Softmax curvature turns into a covariance matrix. LogSumExp keeps your code from overflowing. And a small handful of calculus identities keep reappearing in optimization proofs.
This article is intentionally structured as a compact reference page rather than a single narrative essay. Think of it as the page you return to when you remember the idea but want the derivation back in front of you.
Gaussian noise leads straight to L2 loss
Suppose a linear model generates observations according to
Then the conditional density of one observation is Gaussian:
For a whole dataset, maximizing likelihood is equivalent to minimizing the negative log-likelihood. After dropping constants, the objective becomes
Differentiate with respect to :
Setting that gradient to zero yields the normal equation and, when invertibility holds, the least-squares solution:
This is the cleanest example of a statistical assumption turning directly into a loss function.
Laplace noise gives you L1 loss instead
Change only the noise model:
Now the conditional density becomes
Taking the negative log over the dataset drops you into absolute error:
L1 loss is more robust to outliers, but the price is a kink at zero. The derivative is replaced by a subgradient involving , which means the update magnitude does not naturally shrink as a residual approaches zero.
That is why vanilla SGD with L1 loss can jitter around the optimum. The residual gets tiny, but the subgradient does not fade smoothly the way it does under L2.
In practice, people often soften the kink with Huber loss or rely on learning-rate decay so the oscillation dies out over time.
The Softmax Hessian is a covariance matrix
Let . For one example with cross-entropy loss, the first derivative with respect to the logits is
Differentiate once more and you get the Hessian:
Entrywise, that says
This matrix is the covariance matrix of a one-hot categorical random variable with class probabilities . That matters because covariance matrices are positive semidefinite, so the local curvature is never negative.
It also explains two familiar properties: the rows sum to zero, and the logits are not all independently identifiable because adding the same constant to every logit does not change the Softmax output.
Smooth max is always above the hard max
The log-sum-exp function acts like a differentiable approximation to the maximum. For two numbers and , define
Assume . Then
The correction term is always positive, so smooth max is strictly larger than the hard max. As , that correction term shrinks to zero and the expression converges to .
This is one of those results that looks abstract until you realize Softmax, cross-entropy, and temperature scaling are all living in the same neighborhood.
LogSumExp is the stability trick you use in real code
Directly computing
is numerically dangerous because large positive values of can make overflow. The stable identity is
Since every exponent now has a nonpositive argument, the computation stays in a safe range. Frameworks expose this directly as torch.logsumexp and similar APIs for exactly that reason.
Three calculus facts worth keeping in your pocket
A few small derivatives show up often enough that they are worth memorizing.
1. The derivative of
Use logarithmic differentiation. If , then , so
2. The gradient of the L2 norm
For and ,
At the origin, the expression is undefined because the norm has a cusp there. That is why numerically stable code often adds a small eps before taking a square root or dividing by a norm.
3. The derivative of an inverse function
If and , differentiate to get
This shows up whenever you change coordinates or work with invertible reparameterizations.
The takeaway
These derivations are not side trivia. They tell you why a loss is smooth or nonsmooth, why a Hessian has the shape it has, why some training loops oscillate, and why stable code looks slightly different from naive algebra.
Once those patterns become familiar, a lot of "advanced" deep learning math stops feeling like a wall of symbols. It starts feeling like the same small toolkit, reused carefully.