A Field Guide to Deep Learning Math

A lot of deep learning math looks harder than it really is because each topic arrives with its own notation. Strip the notation away and the same patterns keep showing up.

Gaussian noise gives you squared error. Laplace noise gives you absolute error. Softmax curvature turns into a covariance matrix. LogSumExp keeps your code from overflowing. And a small handful of calculus identities keep reappearing in optimization proofs.

This article is intentionally structured as a compact reference page rather than a single narrative essay. Think of it as the page you return to when you remember the idea but want the derivation back in front of you.

Gaussian noise leads straight to L2 loss

Suppose a linear model generates observations according to

y = \mathbf{w}^\top\mathbf{x} + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2)

Then the conditional density of one observation is Gaussian:

p(y \mid \mathbf{x}, \mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mathbf{w}^\top\mathbf{x})^2}{2\sigma^2}\right)

For a whole dataset, maximizing likelihood is equivalent to minimizing the negative log-likelihood. After dropping constants, the objective becomes

L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2

Differentiate with respect to $\mathbf{w}$ :

\nabla_{\mathbf{w}} L(\mathbf{w}) = 2\mathbf{X}^\top(\mathbf{X}\mathbf{w} - \mathbf{y}) = 2(\mathbf{X}^\top\mathbf{X}\mathbf{w} - \mathbf{X}^\top\mathbf{y})

Setting that gradient to zero yields the normal equation and, when invertibility holds, the least-squares solution:

\mathbf{w}^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}

This is the cleanest example of a statistical assumption turning directly into a loss function.

Laplace noise gives you L1 loss instead

Change only the noise model:

p(\epsilon) = \frac{1}{2}e^{-\lvert \epsilon \rvert}

Now the conditional density becomes

p(y \mid \mathbf{x}, \mathbf{w}) = \frac{1}{2}\exp\left(-\lvert y - \mathbf{w}^\top\mathbf{x} \rvert\right)

Taking the negative log over the dataset drops you into absolute error:

L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_1

L1 loss is more robust to outliers, but the price is a kink at zero. The derivative is replaced by a subgradient involving $\operatorname{sign}(r_i)$ , which means the update magnitude does not naturally shrink as a residual approaches zero.

That is why vanilla SGD with L1 loss can jitter around the optimum. The residual gets tiny, but the subgradient does not fade smoothly the way it does under L2.

In practice, people often soften the kink with Huber loss or rely on learning-rate decay so the oscillation dies out over time.

The Softmax Hessian is a covariance matrix

Let $\mathbf{p} = \operatorname{softmax}(\mathbf{o})$ . For one example with cross-entropy loss, the first derivative with respect to the logits is

\nabla_{\mathbf{o}} \ell = \mathbf{p} - \mathbf{y}

Differentiate once more and you get the Hessian:

\nabla_{\mathbf{o}}^2 \ell = \operatorname{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top

Entrywise, that says

\frac{\partial^2 \ell}{\partial o_i \partial o_j} = p_i(\delta_{ij} - p_j)

This matrix is the covariance matrix of a one-hot categorical random variable with class probabilities $\mathbf{p}$ . That matters because covariance matrices are positive semidefinite, so the local curvature is never negative.

It also explains two familiar properties: the rows sum to zero, and the logits are not all independently identifiable because adding the same constant to every logit does not change the Softmax output.

Smooth max is always above the hard max

The log-sum-exp function acts like a differentiable approximation to the maximum. For two numbers $a$ and $b$ , define

\operatorname{SmoothMax}_{\lambda}(a, b) = \frac{1}{\lambda}\log\left(e^{\lambda a} + e^{\lambda b}\right)

Assume $a \ge b$ . Then

\operatorname{SmoothMax}_{\lambda}(a, b) = a + \frac{1}{\lambda}\log\left(1 + e^{\lambda(b-a)}\right)

The correction term is always positive, so smooth max is strictly larger than the hard max. As $\lambda \to \infty$ , that correction term shrinks to zero and the expression converges to $\max(a,b)$ .

This is one of those results that looks abstract until you realize Softmax, cross-entropy, and temperature scaling are all living in the same neighborhood.

LogSumExp is the stability trick you use in real code

Directly computing

\log\left(\sum_{i=1}^{n} e^{x_i}\right)

is numerically dangerous because large positive values of $x_i$ can make $e^{x_i}$ overflow. The stable identity is

\operatorname{LogSumExp}(\mathbf{x}) = x_{\max} + \log\left(\sum_{i=1}^{n} e^{x_i - x_{\max}}\right)

Since every exponent now has a nonpositive argument, the computation stays in a safe range. Frameworks expose this directly as torch.logsumexp and similar APIs for exactly that reason.

Three calculus facts worth keeping in your pocket

A few small derivatives show up often enough that they are worth memorizing.

1. The derivative of $x^x$

Use logarithmic differentiation. If $y = x^x$ , then $\log y = x\log x$ , so

\frac{dy}{dx} = x^x(\log x + 1)

2. The gradient of the L2 norm

For $f(\mathbf{x}) = \|\mathbf{x}\|_2$ and $\mathbf{x} \ne \mathbf{0}$ ,

\nabla f(\mathbf{x}) = \frac{\mathbf{x}}{\|\mathbf{x}\|_2}

At the origin, the expression is undefined because the norm has a cusp there. That is why numerically stable code often adds a small eps before taking a square root or dividing by a norm.

3. The derivative of an inverse function

If $y = f(x)$ and $x = f^{-1}(y)$ , differentiate $f(f^{-1}(y)) = y$ to get

(f^{-1})'(y) = \frac{1}{f'(f^{-1}(y))}

This shows up whenever you change coordinates or work with invertible reparameterizations.

The takeaway

These derivations are not side trivia. They tell you why a loss is smooth or nonsmooth, why a Hessian has the shape it has, why some training loops oscillate, and why stable code looks slightly different from naive algebra.

Once those patterns become familiar, a lot of "advanced" deep learning math stops feeling like a wall of symbols. It starts feeling like the same small toolkit, reused carefully.

A Field Guide to Deep Learning Math

Gaussian noise leads straight to L2 loss

Laplace noise gives you L1 loss instead

The Softmax Hessian is a covariance matrix

Smooth max is always above the hard max

LogSumExp is the stability trick you use in real code

Three calculus facts worth keeping in your pocket

1. The derivative of $x^x$

2. The gradient of the L2 norm

3. The derivative of an inverse function

The takeaway

Related reading

Gradient Descent, Explained from First Principles

The Chain Rule in Deep Learning: What Backpropagation Is Really Computing

Why Linear Regression Uses Squared Error

Gaussian noise leads straight to L2 loss

Laplace noise gives you L1 loss instead

The Softmax Hessian is a covariance matrix

Smooth max is always above the hard max

LogSumExp is the stability trick you use in real code

Three calculus facts worth keeping in your pocket

1. The derivative of xxx^xxx

2. The gradient of the L2 norm

3. The derivative of an inverse function

The takeaway

Gradient Descent, Explained from First Principles

The Chain Rule in Deep Learning: What Backpropagation Is Really Computing

Why Linear Regression Uses Squared Error

1. The derivative of $x^x$