MathIsimple
Deep Learning
11 min read

The Chain Rule in Deep Learning: What Backpropagation Is Really Computing

How local derivatives multiply along a path and add across branches in a neural network.

Chain RuleBackpropagationJacobianCalculusNeural Networks

A ten-layer network makes a mistake, and the parameter you need to update lives all the way back in layer one. How does the loss even "know" what that early weight did?

That is the chain rule problem in deep learning. A first-layer weight does not touch the loss directly. It changes one hidden activation, which changes the next layer, which changes the final score, which changes the loss. Backpropagation is the system that keeps track of that entire causal chain without recomputing every dependency from scratch.

Strip away the neural-network jargon and the idea is simple: when one variable affects another through a sequence of intermediate steps, the total sensitivity is built from the local sensitivities along that path.

One chain, one product

Start with a physics example. Suppose velocity depends on time, and kinetic energy depends on velocity:

v=2tv = 2tE=12mv2E = \frac{1}{2}mv^2

You want dEdt\frac{dE}{dt}, but the formula for EE does not mention tt explicitly. That does not mean tt is irrelevant. It means the dependence is indirect.

Time changes velocity. Velocity changes energy. So the total effect is the product of those two local rates:

dEdt=dEdvdvdt\frac{dE}{dt} = \frac{dE}{dv} \cdot \frac{dv}{dt}

Differentiate each piece:

dEdv=mv\frac{dE}{dv} = mvdvdt=2\frac{dv}{dt} = 2dEdt=mv2=4mt\frac{dE}{dt} = mv \cdot 2 = 4mt

That is the whole idea of the chain rule on a single path: multiply the effect of each local link. In deep learning, the same logic applies. The only difference is that the chains are longer and branch everywhere.

When one input fans out, the contributions add

Neural networks are not simple single-file chains. One activation can influence several downstream neurons at once. That creates multiple paths from the same upstream variable to the final output.

Suppose x1x_1 feeds into two hidden activations, u1u_1 and u2u_2, and both affect the output yy:

x_1 -> u_1 -> y
x_1 -> u_2 -> y

Now there is no single chain to multiply. There are two. The total effect of x1x_1 on yy is the sum of the effects along every valid route:

yx1=yu1u1x1+yu2u2x1\frac{\partial y}{\partial x_1} = \frac{\partial y}{\partial u_1}\frac{\partial u_1}{\partial x_1} + \frac{\partial y}{\partial u_2}\frac{\partial u_2}{\partial x_1}

That "multiply along each path, then add across paths" rule is exactly what backpropagation is doing all the time. Branching in the forward pass becomes addition in the backward pass.

With mm hidden units, the same rule becomes:

yxi=j=1myujujxi\frac{\partial y}{\partial x_i} = \sum_{j=1}^{m} \frac{\partial y}{\partial u_j}\frac{\partial u_j}{\partial x_i}

A Jacobian is just organized bookkeeping

Once the intermediate layer has many coordinates, writing every partial derivative by hand stops being practical. The clean way to package them is with a Jacobian-style matrix of local sensitivities.

If xRn\mathbf{x} \in \mathbb{R}^n and uRm\mathbf{u} \in \mathbb{R}^m, define a matrix AA by

Aij=ujxiA_{ij} = \frac{\partial u_j}{\partial x_i}ARn×mA \in \mathbb{R}^{n \times m}

This convention makes each row correspond to an input coordinate and each column correspond to a hidden coordinate. Then the chain rule becomes a single matrix multiplication:

xy=Auy\nabla_{\mathbf{x}} y = A \nabla_{\mathbf{u}} y

A concrete shape check helps. If xR3\mathbf{x} \in \mathbb{R}^3 and uR2\mathbf{u} \in \mathbb{R}^2, then AA has shape 3×23 \times 2:

A=[u1x1u2x1u1x2u2x2u1x3u2x3]A = \begin{bmatrix} \frac{\partial u_1}{\partial x_1} & \frac{\partial u_2}{\partial x_1} \\ \frac{\partial u_1}{\partial x_2} & \frac{\partial u_2}{\partial x_2} \\ \frac{\partial u_1}{\partial x_3} & \frac{\partial u_2}{\partial x_3} \end{bmatrix}uy=[yu1yu2],xy=[yx1yx2yx3]\nabla_{\mathbf{u}} y = \begin{bmatrix} \frac{\partial y}{\partial u_1} \\ \frac{\partial y}{\partial u_2} \end{bmatrix}, \qquad \nabla_{\mathbf{x}} y = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \frac{\partial y}{\partial x_3} \end{bmatrix}

The multiplication works out as (3×2)(2×1)=3×1(3 \times 2)(2 \times 1) = 3 \times 1, which is exactly the shape a gradient with respect to x\mathbf{x} should have.

Read that equation from right to left. First you receive the upstream gradient uy\nabla_{\mathbf{u}} y from the layer above. Then you translate it through the local derivative matrix AA. The result is the gradient with respect to the current layer's input.

Backpropagation is repeated matrix multiplication with local derivative information, moving the gradient backward one layer at a time.

One subtle point: many textbooks define the Jacobian with the opposite orientation. Then you will see a transpose appear somewhere else in the formula. That is not a contradiction. It is just a bookkeeping choice about whether gradients are written as row vectors or column vectors. The computation is the same as long as the shapes are consistent.

The gradient does not travel by magic

Suppose a layer takes input x\mathbf{x}, produces output u\mathbf{u}, and the overall training objective is a loss LL. The upstream layer tells you how sensitive the loss is to u\mathbf{u}. Your job is to convert that into sensitivity with respect to x\mathbf{x}.

upstream gradient=uL\text{upstream gradient} = \nabla_{\mathbf{u}} Llocal derivative matrix=A\text{local derivative matrix} = Adownstream gradient=xL=AuL\text{downstream gradient} = \nabla_{\mathbf{x}} L = A\nabla_{\mathbf{u}} L

That is the same operation at every layer. Once you know the upstream gradient and the current layer's local derivatives, you can keep pushing the signal backward. Eventually it reaches the parameters, and that tells the optimizer which direction lowers the loss.

If gradient descent tells you how to move once you know the gradient, the chain rule tells you how to compute that gradient for hidden layers that never touch the loss directly.

What autodiff libraries are really doing

PyTorch, JAX, and TensorFlow do not use a different mathematics. They build the computational graph during the forward pass and then apply the same chain-rule logic during the backward pass. When you call backward(), the framework walks the graph in reverse, multiplies by each operation's local derivative, and accumulates contributions where paths merge.

That is why the first layer can still get a useful update after the network makes a mistake at the output. Every parameter sits on at least one path from the input to the loss. Backpropagation measures how much the loss would change if that parameter changed slightly, and it does so by stitching together local derivatives all the way backward.

The practical takeaway

The chain rule answers one question: if a variable influences the loss indirectly, how do we measure its total effect? The answer depends on the graph structure.

  • If the effect flows through a single chain, multiply the local derivatives.
  • If the effect splits across several paths, add the path contributions.
  • If the layer is vector-valued, package the local derivatives into a matrix and use matrix multiplication.

That is what backpropagation is computing. Not magic. Not a special trick reserved for neural networks. Just the chain rule, applied carefully at scale.

Ask AI ✨