A Practical Guide to Deep Learning Optimizers: SGD to AdamW

Choosing an optimizer in deep learning is choosing a hypothesis about which directions in parameter space deserve big steps and which deserve small ones. SGD treats all directions equally. Momentum methods use history to smooth noisy gradients. Adaptive methods like AdaGrad, RMSProp, and Adam scale steps differently for different parameters. Understanding the math behind each one explains why they behave so differently in practice.

Vanilla SGD: the baseline

Stochastic gradient descent computes the gradient on a minibatch and takes a step in the negative gradient direction:

\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}_{\text{batch}}(\theta_t)

The learning rate $\eta$ is the only hyperparameter. Step size is the same in every direction, and there is no memory of past gradients. SGD's simplicity is also its weakness: it struggles when the loss surface has very different curvatures in different directions, because the learning rate that works for one direction is wrong for another.

Two well-known SGD pathologies:

Zigzag in narrow valleys. When one direction is much steeper than the perpendicular direction, SGD bounces back and forth across the valley walls instead of walking down the floor.
Slow convergence in flat regions. Small gradients produce small steps, so progress stalls in basins where the loss surface is nearly flat.

Momentum: smoothing noisy gradients with history

Momentum maintains a running "velocity" vector that is an exponential moving average of past gradients:

v_{t+1} = \beta v_t + \nabla_\theta \mathcal{L}(\theta_t)

\theta_{t+1} = \theta_t - \eta v_{t+1}

The momentum coefficient $\beta$ is typically 0.9. Each step is influenced by the long history of gradients, not just the current minibatch.

Two effects matter. First, in directions where gradients consistently point the same way, the velocity accumulates and steps grow larger. Second, in directions where gradients oscillate (zigzag), the oscillations partially cancel in the velocity, so the effective step size in that direction decreases. Both effects fix the SGD pathologies above.

A useful interpretation: momentum is like adding mass to the optimizer. A heavy ball rolling down a valley does not respond to every small bump — it averages over the recent gradient signal and keeps moving in the dominant direction. The mathematics is identical to a discretized physical system with friction.

Nesterov momentum: a small correction with real benefits

Nesterov accelerated gradient (NAG) makes one change: compute the gradient at the "look-ahead" position rather than the current position:

v_{t+1} = \beta v_t + \nabla_\theta \mathcal{L}(\theta_t - \eta \beta v_t)

\theta_{t+1} = \theta_t - \eta v_{t+1}

The intuition: instead of taking a momentum step and then correcting, peek ahead to where momentum will take you, then compute the gradient there. This gives the optimizer a slight foresight, which can prevent overshoot and produces measurably better convergence on convex problems. In deep learning, the difference is usually small but consistent.

AdaGrad: per-parameter learning rates

AdaGrad introduces a critical idea: different parameters should have different effective learning rates based on their gradient history. The update rule:

G_{t+1} = G_t + g_t \odot g_t

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_{t+1} + \epsilon}} \odot g_t

Here $G_t$ accumulates the element-wise sum of squared gradients, $\odot$ is element-wise multiplication, and $\epsilon$ (typically $10^{-8}$ ) prevents division by zero. The effective learning rate for each parameter is $\eta / \sqrt{G_t + \epsilon}$ .

Parameters that have received large gradients are stepped slowly because their $G$ is large. Parameters with small gradient history get larger steps. This is especially useful for sparse features in NLP — rare words get more aggressive updates than common ones.

AdaGrad's weakness is also structural: $G_t$ only ever grows. Once a parameter has accumulated enough squared gradient, its effective learning rate becomes vanishingly small, and learning effectively stops for that parameter. For long deep learning training, this is fatal.

RMSProp: AdaGrad with forgetting

RMSProp fixes AdaGrad's monotonic accumulation by replacing the running sum with an exponential moving average:

E[g^2]_{t+1} = \rho E[g^2]_t + (1 - \rho) g_t \odot g_t

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_{t+1} + \epsilon}} \odot g_t

The decay rate $\rho$ (typically 0.9) means the running estimate of $g^2$ reflects recent gradients rather than all-time history. Effective learning rates can grow back if gradients shrink, so learning never permanently stops.

RMSProp was the dominant adaptive method before Adam. Geoff Hinton introduced it in a Coursera lecture without ever publishing a formal paper, which is unusual but does not detract from its empirical success.

Adam: momentum + RMSProp + bias correction

Adam combines the two ideas: maintain both a first moment (running mean of gradients, like momentum) and a second moment (running mean of squared gradients, like RMSProp), then add a bias correction step.

First moment (momentum-like):

m_{t+1} = \beta_1 m_t + (1 - \beta_1) g_t

Second moment (RMSProp-like):

v_{t+1} = \beta_2 v_t + (1 - \beta_2) g_t \odot g_t

Bias correction (the moment estimates start at 0 and are biased toward 0 in early steps):

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Update:

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ . These rarely need tuning. The learning rate $\eta$ typically starts at $10^{-3}$ for fresh training or $10^{-4}$ for fine-tuning.

The bias correction is crucial early in training. Without it, $m_t$ and $v_t$ are biased toward zero in the first dozen iterations, producing small effective learning rates and slow start. The correction $1 / (1 - \beta^t)$ is large early and approaches 1 as $t$ grows.

AdamW: decoupled weight decay

A subtle but important fix to Adam is decoupling weight decay from the gradient. In plain Adam, weight decay is implemented by adding $\lambda \theta$ to the gradient before the moment updates:

g_t \leftarrow g_t + \lambda \theta_t

This couples decay with the adaptive scaling. Parameters with historically large gradients have small effective learning rates, so they also receive small effective weight decay — defeating the purpose of regularization for the parameters that matter most.

AdamW applies weight decay directly to the parameters, after the Adam step:

\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

AdamW typically generalizes better than Adam at the same nominal weight decay setting. For Transformer training and modern fine-tuning recipes, AdamW is the standard.

A concise comparison

Optimizer	Per-param LR	Momentum	Memory	Best for
SGD	No	No	1× params	Convex baselines
SGD + Momentum	No	Yes	2× params	CNN training (often best for ResNet/ImageNet)
AdaGrad	Yes	No	2× params	Sparse features (NLP, recommender)
RMSProp	Yes	No	2× params	RNN training
Adam	Yes	Yes	3× params	General default
AdamW	Yes	Yes	3× params	Transformers, fine-tuning

When SGD with momentum still wins

Adam is the modern default, but SGD with momentum is not obsolete. Some patterns where it still wins:

ResNet / ImageNet training. The original ResNet recipe uses SGD with momentum 0.9, weight decay $10^{-4}$ , and a step learning rate schedule. Adam often produces slightly worse final accuracy on this benchmark.
When test accuracy matters most. SGD's implicit regularization toward flat minima can produce models that generalize better, even when training loss is higher than Adam's.
When Adam's adaptive scaling hurts. Some convex problems and well-conditioned losses do not benefit from per-parameter learning rates and can be hurt by Adam's noisy second-moment estimates.

A practical heuristic: start with AdamW or Adam for fast iteration, but try SGD with momentum if final test performance matters and you have time to tune.

Using these optimizers in PyTorch

import torch.optim as optim

# Vanilla SGD
optimizer = optim.SGD(model.parameters(), lr=0.01)

# SGD with momentum + weight decay (the ResNet recipe)
optimizer = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
    nesterov=True,
)

# Adam (default for general deep learning)
optimizer = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
)

# AdamW (the modern default for Transformers and fine-tuning)
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    weight_decay=0.01,
)

Note that optim.Adam's weight_decay parameter is not the AdamW formulation — it adds the L2 term to the gradient, which is the coupled version. To get true decoupled weight decay, use optim.AdamW.

The main takeaway

The progression from SGD to Adam is a series of small, principled additions: momentum to smooth noisy gradients, per-parameter rate scaling to handle different curvatures, exponential moving averages to forget old history, and bias correction to start training cleanly. Each addition responds to a specific weakness of the previous version.

In practice, two optimizers cover most situations. SGD with momentum remains a strong choice for image classification benchmarks where careful tuning matters and generalization is paramount. AdamW is the modern default for Transformer training, language models, and most fine-tuning scenarios. Adaptive methods like RMSProp or AdaGrad are mostly historical interest at this point — Adam subsumes their benefits while fixing their weaknesses.

Understanding the math means you do not have to treat optimizer choice as guesswork. When training stalls, when convergence is too noisy, when the model overfits faster than expected, the equations above tell you what each knob is actually doing and why a different optimizer might help.