MathIsimple
Deep Learning
14 min read

A Field Guide to Deep Learning Math

A compact reference for losses, Softmax curvature, LogSumExp, and the calculus facts that keep resurfacing.

L1 LossL2 LossSoftmaxLogSumExpCalculus

A lot of deep learning math looks harder than it really is because each topic arrives with its own notation. Strip the notation away and the same patterns keep showing up.

Gaussian noise gives you squared error. Laplace noise gives you absolute error. Softmax curvature turns into a covariance matrix. LogSumExp keeps your code from overflowing. And a small handful of calculus identities keep reappearing in optimization proofs.

This article is intentionally structured as a compact reference page rather than a single narrative essay. Think of it as the page you return to when you remember the idea but want the derivation back in front of you.

Gaussian noise leads straight to L2 loss

Suppose a linear model generates observations according to

y=wx+ϵ,ϵN(0,σ2)y = \mathbf{w}^\top\mathbf{x} + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2)

Then the conditional density of one observation is Gaussian:

p(yx,w)=12πσ2exp((ywx)22σ2)p(y \mid \mathbf{x}, \mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mathbf{w}^\top\mathbf{x})^2}{2\sigma^2}\right)

For a whole dataset, maximizing likelihood is equivalent to minimizing the negative log-likelihood. After dropping constants, the objective becomes

L(w)=yXw22L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2

Differentiate with respect to w\mathbf{w}:

wL(w)=2X(Xwy)=2(XXwXy)\nabla_{\mathbf{w}} L(\mathbf{w}) = 2\mathbf{X}^\top(\mathbf{X}\mathbf{w} - \mathbf{y}) = 2(\mathbf{X}^\top\mathbf{X}\mathbf{w} - \mathbf{X}^\top\mathbf{y})

Setting that gradient to zero yields the normal equation and, when invertibility holds, the least-squares solution:

w=(XX)1Xy\mathbf{w}^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}

This is the cleanest example of a statistical assumption turning directly into a loss function.

Laplace noise gives you L1 loss instead

Change only the noise model:

p(ϵ)=12eϵp(\epsilon) = \frac{1}{2}e^{-\lvert \epsilon \rvert}

Now the conditional density becomes

p(yx,w)=12exp(ywx)p(y \mid \mathbf{x}, \mathbf{w}) = \frac{1}{2}\exp\left(-\lvert y - \mathbf{w}^\top\mathbf{x} \rvert\right)

Taking the negative log over the dataset drops you into absolute error:

L(w)=yXw1L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_1

L1 loss is more robust to outliers, but the price is a kink at zero. The derivative is replaced by a subgradient involving sign(ri)\operatorname{sign}(r_i), which means the update magnitude does not naturally shrink as a residual approaches zero.

That is why vanilla SGD with L1 loss can jitter around the optimum. The residual gets tiny, but the subgradient does not fade smoothly the way it does under L2.

In practice, people often soften the kink with Huber loss or rely on learning-rate decay so the oscillation dies out over time.

The Softmax Hessian is a covariance matrix

Let p=softmax(o)\mathbf{p} = \operatorname{softmax}(\mathbf{o}). For one example with cross-entropy loss, the first derivative with respect to the logits is

o=py\nabla_{\mathbf{o}} \ell = \mathbf{p} - \mathbf{y}

Differentiate once more and you get the Hessian:

o2=diag(p)pp\nabla_{\mathbf{o}}^2 \ell = \operatorname{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top

Entrywise, that says

2oioj=pi(δijpj)\frac{\partial^2 \ell}{\partial o_i \partial o_j} = p_i(\delta_{ij} - p_j)

This matrix is the covariance matrix of a one-hot categorical random variable with class probabilities p\mathbf{p}. That matters because covariance matrices are positive semidefinite, so the local curvature is never negative.

It also explains two familiar properties: the rows sum to zero, and the logits are not all independently identifiable because adding the same constant to every logit does not change the Softmax output.

Smooth max is always above the hard max

The log-sum-exp function acts like a differentiable approximation to the maximum. For two numbers aa and bb, define

SmoothMaxλ(a,b)=1λlog(eλa+eλb)\operatorname{SmoothMax}_{\lambda}(a, b) = \frac{1}{\lambda}\log\left(e^{\lambda a} + e^{\lambda b}\right)

Assume aba \ge b. Then

SmoothMaxλ(a,b)=a+1λlog(1+eλ(ba))\operatorname{SmoothMax}_{\lambda}(a, b) = a + \frac{1}{\lambda}\log\left(1 + e^{\lambda(b-a)}\right)

The correction term is always positive, so smooth max is strictly larger than the hard max. As λ\lambda \to \infty, that correction term shrinks to zero and the expression converges to max(a,b)\max(a,b).

This is one of those results that looks abstract until you realize Softmax, cross-entropy, and temperature scaling are all living in the same neighborhood.

LogSumExp is the stability trick you use in real code

Directly computing

log(i=1nexi)\log\left(\sum_{i=1}^{n} e^{x_i}\right)

is numerically dangerous because large positive values of xix_i can make exie^{x_i} overflow. The stable identity is

LogSumExp(x)=xmax+log(i=1nexixmax)\operatorname{LogSumExp}(\mathbf{x}) = x_{\max} + \log\left(\sum_{i=1}^{n} e^{x_i - x_{\max}}\right)

Since every exponent now has a nonpositive argument, the computation stays in a safe range. Frameworks expose this directly as torch.logsumexp and similar APIs for exactly that reason.

Three calculus facts worth keeping in your pocket

A few small derivatives show up often enough that they are worth memorizing.

1. The derivative of xxx^x

Use logarithmic differentiation. If y=xxy = x^x, then logy=xlogx\log y = x\log x, so

dydx=xx(logx+1)\frac{dy}{dx} = x^x(\log x + 1)

2. The gradient of the L2 norm

For f(x)=x2f(\mathbf{x}) = \|\mathbf{x}\|_2 and x0\mathbf{x} \ne \mathbf{0},

f(x)=xx2\nabla f(\mathbf{x}) = \frac{\mathbf{x}}{\|\mathbf{x}\|_2}

At the origin, the expression is undefined because the norm has a cusp there. That is why numerically stable code often adds a small eps before taking a square root or dividing by a norm.

3. The derivative of an inverse function

If y=f(x)y = f(x) and x=f1(y)x = f^{-1}(y), differentiate f(f1(y))=yf(f^{-1}(y)) = y to get

(f1)(y)=1f(f1(y))(f^{-1})'(y) = \frac{1}{f'(f^{-1}(y))}

This shows up whenever you change coordinates or work with invertible reparameterizations.

The takeaway

These derivations are not side trivia. They tell you why a loss is smooth or nonsmooth, why a Hessian has the shape it has, why some training loops oscillate, and why stable code looks slightly different from naive algebra.

Once those patterns become familiar, a lot of "advanced" deep learning math stops feeling like a wall of symbols. It starts feeling like the same small toolkit, reused carefully.

Ask AI ✨