Many tutorials introduce backpropagation by diving straight into matrix calculus. But before memorizing the formulas for deep linear layers and cross-entropy, it helps to walk through a small, concrete computation graph by hand. Tracing a single forward and backward pass reveals the mechanics of branching gradients and regularization without losing them in the notation.
In this example, we will track four parameters a,b,c,d through a small artificial network across two parameter update steps.
The computation graph and the constants
To keep the arithmetic clean, we use a simple scalar input and a set of elementary functions. The given constants for our training steps are:
- Input: x=2
- Target label: y=1
- Regularization coefficient: λ=0.1
- Learning rate: η=0.1
The forward pass is defined by the following sequence of operations:
u=ax+b (Linear transform) h=u2 (Nonlinear activation) v=ch+u+d (Second linear branch) p=1+v1 (Output function) L=(p−y)2 (Data loss) R=2λ(a2+c2) (L2 Regularization) J=L+R (Total objective) Notice the equation v=ch+u+d. The variable u flows into v through two paths: directly, and indirectly via h. This is a classic "branch and merge" in a computation graph. During backpropagation, the gradients along these two paths must be added together.
Backpropagation: Mapping the paths
Before calculating derivatives, it is useful to map out the routes from the final objective J back to the parameters.
For parameter a, there are two sources of gradient:
- Data path: J→L→p→v→u→a
- Regularization path: J→R→a
For parameter c, we also have two sources:
- Data path: J→L→p→v→h→c
- Regularization path: J→R→c
The most critical detail is calculating how v depends on u. Because of the branching paths, the chain rule dictates that we sum the direct and indirect sensitivities:
∂u∂v=∂u∂(ch+u+d)=c∂u∂h+1=2cu+1 Deriving the gradient formulas
To avoid writing the same sequence repeatedly, we can define an intermediate variable representing the sensitivity of the loss to the pre-output v:
δ=∂v∂L=2(p−y)⋅(1+v)2−1 Using δ, the gradients for each parameter are:
- ∂d∂J=δ
- ∂c∂J=δ⋅h+λc
- ∂b∂J=δ⋅(2cu+1)
- ∂a∂J=δ⋅(2cu+1)⋅x+λa
Notice how the gradient for a combines the data gradient (which explicitly depends on the input x) and the regularization gradient.
Iteration 1: Setting the foundation
We start with initial parameters: a=1,b=0,c=1,d=0, and input x=2.
1. Forward Pass
- u=1(2)+0=2
- h=22=4
- v=1(4)+2+0=6
- p=1/(1+6)≈0.142857
- J≈0.8346
2. Backward Pass
Computing δ gives δ≈0.034985. We then calculate the parameter gradients:
- ∂d∂J≈0.034985
- ∂c∂J≈0.034985×4+0.1(1)=0.239942
- ∂b∂J≈0.034985×(2(1)(2)+1)=0.174927
- ∂a∂J≈0.174927×2+0.1(1)=0.449854
3. Parameter Update
Using gradient descent θ←θ−η∇:
- a1≈1−0.1(0.449854)=0.955015
- b1≈0−0.1(0.174927)=−0.017493
- c1≈1−0.1(0.239942)=0.976006
- d1≈0−0.1(0.034985)=−0.003499
The first iteration behaves exactly as expected. The loss creates a signal, it flows backwards, and the network parameters adjust to make p slightly closer to y=1.
Iteration 2: The zero-input scenario
Let us perform a second update, but this time assume the training loop feeds in a new data point where x=0.
1. Forward Pass
- u=0.955(0)−0.017493=−0.017493
- h≈0.000306
- v≈−0.02069
- p≈1.0211
- J≈0.0936
2. Backward Pass
Because p is quite close to y=1, our δ reverses sign: δ≈−0.044064.
- ∂d∂J≈−0.044064
- ∂c∂J≈0.097587
- ∂b∂J≈−0.042560
- ∂a∂J=δ⋅(2cu+1)⋅x+λa=⋯⋅0+0.1(0.955)≈0.0955
Look closely at the gradient for a. Because x=0, the entire error signal flowing back from the loss function is multiplied by zero. The data provides no information about how to update a. However, the gradient is not zero. The regularization path λa is still active, gently pulling the parameter back toward the origin.
3. Parameter Update
- a2≈0.955015−0.1(0.0955)=0.945464
- b2≈−0.017493−0.1(−0.04256)=−0.013237
- c2≈0.976006−0.1(0.097587)=0.966247
- d2≈−0.003499−0.1(−0.04406)=0.000908
The main takeaway
Stepping through the computation graph manually grounds two extremely important concepts in deep learning:
- Branch and merge means sum the gradients. Because u flowed into v through two separate operational paths, finding the gradient required summing the partial derivatives across both branches. This is the structural heart of backpropagation.
- Regularization acts as an independent force. The weight decay path is disconnected from the data. Even when a training sample completely zeroes out the data gradient for a parameter, the regularization gradient continues to pull the weight toward zero.