Neural Network Training: A Step-by-Step Computation Graph Example

Many tutorials introduce backpropagation by diving straight into matrix calculus. But before memorizing the formulas for deep linear layers and cross-entropy, it helps to walk through a small, concrete computation graph by hand. Tracing a single forward and backward pass reveals the mechanics of branching gradients and regularization without losing them in the notation.

In this example, we will track four parameters $a, b, c, d$ through a small artificial network across two parameter update steps.

The computation graph and the constants

To keep the arithmetic clean, we use a simple scalar input and a set of elementary functions. The given constants for our training steps are:

Input: $x = 2$
Target label: $y = 1$
Regularization coefficient: $\lambda = 0.1$
Learning rate: $\eta = 0.1$

The forward pass is defined by the following sequence of operations:

u = a x + b

(Linear transform)

h = u^2

(Nonlinear activation)

v = c h + u + d

(Second linear branch)

p = \frac{1}{1+v}

(Output function)

L = (p - y)^2

(Data loss)

R = \frac{\lambda}{2}(a^2+c^2)

(L2 Regularization)

J = L + R

(Total objective)

Notice the equation $v = c h + u + d$ . The variable $u$ flows into $v$ through two paths: directly, and indirectly via $h$ . This is a classic "branch and merge" in a computation graph. During backpropagation, the gradients along these two paths must be added together.

Backpropagation: Mapping the paths

Before calculating derivatives, it is useful to map out the routes from the final objective $J$ back to the parameters.

For parameter $a$ , there are two sources of gradient:

Data path: $J \to L \to p \to v \to u \to a$
Regularization path: $J \to R \to a$

For parameter $c$ , we also have two sources:

Data path: $J \to L \to p \to v \to h \to c$
Regularization path: $J \to R \to c$

The most critical detail is calculating how $v$ depends on $u$ . Because of the branching paths, the chain rule dictates that we sum the direct and indirect sensitivities:

\frac{\partial v}{\partial u} = \frac{\partial}{\partial u}(c h + u + d) = c \frac{\partial h}{\partial u} + 1 = 2cu + 1

Deriving the gradient formulas

To avoid writing the same sequence repeatedly, we can define an intermediate variable representing the sensitivity of the loss to the pre-output $v$ :

\delta = \frac{\partial L}{\partial v} = 2(p - y) \cdot \frac{-1}{(1+v)^2}

Using $\delta$ , the gradients for each parameter are:

$\frac{\partial J}{\partial d} = \delta$
$\frac{\partial J}{\partial c} = \delta \cdot h + \lambda c$
$\frac{\partial J}{\partial b} = \delta \cdot (2cu + 1)$
$\frac{\partial J}{\partial a} = \delta \cdot (2cu + 1) \cdot x + \lambda a$

Notice how the gradient for $a$ combines the data gradient (which explicitly depends on the input $x$ ) and the regularization gradient.

Iteration 1: Setting the foundation

We start with initial parameters: $a = 1, b = 0, c = 1, d = 0$ , and input $x = 2$ .

1. Forward Pass

$u = 1(2) + 0 = 2$
$h = 2^2 = 4$
$v = 1(4) + 2 + 0 = 6$
$p = 1 / (1 + 6) \approx 0.142857$
$J \approx 0.8346$

2. Backward Pass

Computing $\delta$ gives $\delta \approx 0.034985$ . We then calculate the parameter gradients:

$\frac{\partial J}{\partial d} \approx 0.034985$
$\frac{\partial J}{\partial c} \approx 0.034985 \times 4 + 0.1(1) = 0.239942$
$\frac{\partial J}{\partial b} \approx 0.034985 \times (2(1)(2) + 1) = 0.174927$
$\frac{\partial J}{\partial a} \approx 0.174927 \times 2 + 0.1(1) = 0.449854$

3. Parameter Update

Using gradient descent $\theta \leftarrow \theta - \eta \nabla$ :

$a_1 \approx 1 - 0.1(0.449854) = 0.955015$
$b_1 \approx 0 - 0.1(0.174927) = -0.017493$
$c_1 \approx 1 - 0.1(0.239942) = 0.976006$
$d_1 \approx 0 - 0.1(0.034985) = -0.003499$

The first iteration behaves exactly as expected. The loss creates a signal, it flows backwards, and the network parameters adjust to make $p$ slightly closer to $y=1$ .

Iteration 2: The zero-input scenario

Let us perform a second update, but this time assume the training loop feeds in a new data point where $x = 0$ .

1. Forward Pass

$u = 0.955(0) - 0.017493 = -0.017493$
$h \approx 0.000306$
$v \approx -0.02069$
$p \approx 1.0211$
$J \approx 0.0936$

2. Backward Pass

Because $p$ is quite close to $y=1$ , our $\delta$ reverses sign: $\delta \approx -0.044064$ .

$\frac{\partial J}{\partial d} \approx -0.044064$
$\frac{\partial J}{\partial c} \approx 0.097587$
$\frac{\partial J}{\partial b} \approx -0.042560$
$\frac{\partial J}{\partial a} = \delta \cdot (2cu+1) \cdot x + \lambda a = \cdots \cdot 0 + 0.1(0.955) \approx 0.0955$

Look closely at the gradient for $a$ . Because $x = 0$ , the entire error signal flowing back from the loss function is multiplied by zero. The data provides no information about how to update $a$ . However, the gradient is not zero. The regularization path $\lambda a$ is still active, gently pulling the parameter back toward the origin.

3. Parameter Update

$a_2 \approx 0.955015 - 0.1(0.0955) = 0.945464$
$b_2 \approx -0.017493 - 0.1(-0.04256) = -0.013237$
$c_2 \approx 0.976006 - 0.1(0.097587) = 0.966247$
$d_2 \approx -0.003499 - 0.1(-0.04406) = 0.000908$

The main takeaway

Stepping through the computation graph manually grounds two extremely important concepts in deep learning:

Branch and merge means sum the gradients. Because $u$ flowed into $v$ through two separate operational paths, finding the gradient required summing the partial derivatives across both branches. This is the structural heart of backpropagation.
Regularization acts as an independent force. The weight decay path is disconnected from the data. Even when a training sample completely zeroes out the data gradient for a parameter, the regularization gradient continues to pull the weight toward zero.

Neural Network Training: A Step-by-Step Computation Graph Example

The computation graph and the constants

Backpropagation: Mapping the paths

Deriving the gradient formulas

Iteration 1: Setting the foundation

1. Forward Pass

2. Backward Pass

3. Parameter Update

Iteration 2: The zero-input scenario

1. Forward Pass

2. Backward Pass

3. Parameter Update

The main takeaway

Related reading

Gradient Descent, Explained from First Principles

The Chain Rule in Deep Learning: What Backpropagation Is Really Computing

Why Linear Regression Uses Squared Error