MathIsimple
Deep Learning
10 min read

Gradient Descent, Explained from First Principles

Imagine you're blindfolded and dropped into a valley. Your only goal: reach the bottom.

Gradient DescentDeep Learning BasicsOptimizationMath Intuition

Imagine you're blindfolded and dropped into a valley. Your only goal is to reach the bottom.

You can't see a map. But you can feel the slope beneath your feet. Before each step, you probe the ground in every direction — north, south, east, west — and sense which way tilts downward the most. Then you step that way.

That process is the intuitive core of gradient descent. And the "gradient" is exactly that feeling under your feet — a mathematical tool that measures the slope of a function in every direction at once.

What a Gradient Actually Is

For a function with multiple input variables, the gradient is simply the vector you get when you line up all the partial derivatives side by side.

A partial derivative answers one focused question: if you hold every other input constant and nudge just one of them, how fast does the function's output change? Collect that answer for every input direction, stack them into a single vector, and you have the gradient.

That vector has two properties worth memorizing:

  • Direction — it points toward the steepest uphill route
  • Magnitude — it tells you how fast the function rises in that direction

To descend as quickly as possible, flip the sign. The negative gradient points straight downhill — the direction of steepest descent.

A Concrete Bowl-Shaped Valley

Let's work through the simplest possible example and compute everything from scratch.

Say the elevation of a bowl-shaped valley is given by:

f(x1,x2)=x12+x22f(x_1, x_2) = x_1^2 + x_2^2
  • x1x_1 — your east–west position
  • x2x_2 — your north–south position
  • f(x1,x2)f(x_1, x_2) — your elevation at that point

The valley floor sits at the origin (0,0)(0, 0) with elevation zero. The further out you go, the higher the terrain.

Step 1 — Establish Your Starting Position

You're standing at x=[2, 3]\mathbf{x} = [2,\ 3]^\top. Your current elevation:

f(2,3)=22+32=4+9=13f(2, 3) = 2^2 + 3^2 = 4 + 9 = 13

Step 2 — Probe Each Direction (Compute Partial Derivatives)

East–west slope (partial derivative with respect to x1x_1):

Hold x2x_2 fixed and differentiate x12+x22x_1^2 + x_2^2 with respect to x1x_1. The result is 2x12x_1. Evaluate at x1=2x_1 = 2:

fx1(2,3)=2×2=4\frac{\partial f}{\partial x_1}\bigg|_{(2,3)} = 2 \times 2 = 4

Moving one unit east raises your elevation by 4.

North–south slope (partial derivative with respect to x2x_2):

Hold x1x_1 fixed and differentiate with respect to x2x_2. The result is 2x22x_2. Evaluate at x2=3x_2 = 3:

fx2(2,3)=2×3=6\frac{\partial f}{\partial x_2}\bigg|_{(2,3)} = 2 \times 3 = 6

Moving one unit north raises your elevation by 6. The northbound slope is steeper.

Step 3 — Assemble the Gradient Vector

Stack the two partial derivatives into a single vector:

f(2,3)=[46]\nabla f(2, 3) = \begin{bmatrix} 4 \\ 6 \end{bmatrix}

This vector says: at position (2,3)(2, 3), the fastest way uphill is to travel roughly northeast — 4 units east for every 6 units north. That's the direction of steepest ascent.

Step 4 — Descend Along the Negative Gradient

Flip the sign to point downhill instead:

f(2,3)=[46]-\nabla f(2, 3) = \begin{bmatrix} -4 \\ -6 \end{bmatrix}

Step in this direction from (2,3)(2, 3) — west and south — and you're descending as fast as the terrain allows.

How This Maps onto Deep Learning

In machine learning, the "valley" is the loss function — a surface whose elevation represents how wrong your model's predictions are. Lower is better. The valley floor is the best possible model.

The catch: a real neural network's loss surface might have hundreds of millions of input dimensions — one per trainable parameter.

Training is a three-step loop, repeated thousands of times:

  1. Compute the gradient — evaluate the partial derivative of the loss with respect to every parameter simultaneously, forming a gradient vector of enormous size
  2. Step downhill — nudge every parameter slightly in the direction opposite to its partial derivative
  3. Repeat — keep iterating until the loss stops shrinking meaningfully

This algorithm is gradient descent. In practice, each iteration uses a small random batch of training examples rather than the full dataset — that variant is called stochastic gradient descent (SGD).

One important detail: in deep networks, computing the gradient of the loss with respect to the parameters in early layers isn't trivial. The loss signal originates at the output and needs to travel backward through every layer. The math that makes this efficient — applying the chain rule layer by layer from output to input — is called backpropagation. It's the subject of the next article in this series.

Matrix Calculus: Gradient Rules at Scale

Once inputs are vectors or matrices instead of scalars, the same differentiation rules apply — just in higher-dimensional form. Here's the key correspondence worth knowing.

Linear Terms

The scalar rule ddx(ax)=a\frac{d}{dx}(ax) = a generalizes to:

xAx=A\nabla_{\mathbf{x}}\, \mathbf{A}\mathbf{x} = \mathbf{A}^\topxxA=A\nabla_{\mathbf{x}}\, \mathbf{x}^\top\mathbf{A} = \mathbf{A}

A\mathbf{A} is a constant matrix; x\mathbf{x} is the variable vector. Differentiating a linear expression eliminates the variable and leaves the coefficient — the transpose just keeps the dimensions consistent.

Quadratic Forms

The scalar rule ddx(ax2)=2ax\frac{d}{dx}(ax^2) = 2ax generalizes to:

xxAx=(A+A)x\nabla_{\mathbf{x}}\, \mathbf{x}^\top\mathbf{A}\mathbf{x} = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}

The expression xAx\mathbf{x}^\top\mathbf{A}\mathbf{x} is a quadratic form — x\mathbf{x} appears twice. When A\mathbf{A} is symmetric (A=A\mathbf{A} = \mathbf{A}^\top), this simplifies to 2Ax2\mathbf{A}\mathbf{x}, a perfect match for the scalar analog.

Squared Norms

The scalar rule ddx(x2)=2x\frac{d}{dx}(x^2) = 2x generalizes to:

xx2=xxx=2x\nabla_{\mathbf{x}}\, \|\mathbf{x}\|^2 = \nabla_{\mathbf{x}}\, \mathbf{x}^\top\mathbf{x} = 2\mathbf{x}XXF2=2X\nabla_{\mathbf{X}}\, \|\mathbf{X}\|_F^2 = 2\mathbf{X}

x2\|\mathbf{x}\|^2 is the squared Euclidean length of a vector; XF2\|\mathbf{X}\|_F^2 is the Frobenius norm — the sum of squares of all matrix entries. Both differentiate to exactly twice the input, just as in the scalar case.

The Takeaway

Gradient descent reduces to a single sentence: at your current position, take a small step in the direction opposite the gradient.

Whether you're navigating a two-dimensional bowl or a hundred-million-dimensional loss surface, that rule doesn't change. Compute the partial derivatives, form the gradient, negate it, update the parameters — and repeat until you've reached the valley floor.

Next up: backpropagation — how a deep network's error signal flows backward from the output layer all the way to the first layer, giving every parameter in the network a gradient it can act on.

Ask AI ✨