Learning Rate Schedules: Warmup, Cosine, Step Decay, and Plateau

A constant learning rate is almost never the right choice for deep learning. Early in training, the model needs large steps to escape its random initialization and reach a useful basin. Late in training, large steps overshoot the minimum and prevent convergence. Learning rate schedules manage this transition automatically — and they are often the difference between a model that reaches 95% accuracy and one that plateaus at 90%.

Why a constant learning rate fails

Pick a single learning rate and you have to compromise. A rate large enough for fast early progress is too large for fine convergence. A rate small enough for fine convergence is too small for early training, and the model takes forever to escape its initialization.

The empirical pattern across nearly every deep learning task is the same:

Phase 1 (Warmup): small steps to stabilize early training, especially with adaptive optimizers whose statistics are noisy in the first few hundred iterations.
Phase 2 (Main training): large steps to make rapid progress through the bulk of the loss landscape.
Phase 3 (Annealing): progressively smaller steps to converge precisely to a low-loss point.

Schedules implement this three-phase pattern. The differences between schedules are about exactly how each phase is shaped.

Step decay: the classical schedule

Step decay multiplies the learning rate by a constant factor at predetermined epochs:

\eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}

Typical values: $\eta_0 = 0.1$ , $\gamma = 0.1$ , step size $s = 30$ epochs. The learning rate stays constant for 30 epochs, then drops by 10×, stays at the new value for 30 more epochs, and so on. ResNet's original ImageNet recipe uses this exact schedule with drops at epochs 30, 60, and 90.

from torch.optim.lr_scheduler import StepLR

scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

for epoch in range(num_epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Step decay is simple and reliable but produces visible jumps in training loss at each step. Those jumps are not bugs — they reflect the optimizer suddenly taking smaller steps and finding a slightly lower point in the basin. Loss curves with step decay show characteristic stair-step patterns that signal the schedule is doing its job.

Exponential decay: smooth instead of stepped

Exponential decay multiplies the learning rate by a fixed factor every epoch:

\eta_t = \eta_0 \cdot \gamma^t

With $\gamma = 0.95$ , the learning rate drops by 5% every epoch, reaching about 5% of its initial value after 60 epochs. This produces a smooth decay curve without the discontinuities of step decay.

Exponential decay is rarely the best choice in practice — it decays too aggressively early and too gently late. Cosine annealing, described next, fixes both problems.

Cosine annealing: the modern default

Cosine annealing follows the shape of a cosine curve from $\eta_0$ down to $\eta_{\min}$ over $T_{\max}$ epochs:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min}) \left(1 + \cos\left(\frac{t \pi}{T_{\max}}\right)\right)

The shape is smooth, decays slowly at the start (preserving large steps where they matter), accelerates through the middle, and decays slowly at the end (allowing precise convergence). It produces visibly smoother loss curves than step decay, often with better final accuracy.

from torch.optim.lr_scheduler import CosineAnnealingLR

scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=1e-6)

for epoch in range(num_epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Cosine annealing has become the default for fine-tuning Transformers and large-scale image classification. Setting $T_{\max}$ equal to the total number of training epochs is the standard recipe — the learning rate hits its minimum at the last epoch.

Warmup: linear ramp-up at the start

Most modern recipes prepend a warmup phase: linearly increase the learning rate from a small value to $\eta_0$ over the first $T_{\text{warmup}}$ steps, then apply the main schedule.

\eta_t = \begin{cases} \eta_0 \cdot \frac{t}{T_{\text{warmup}}} & t \le T_{\text{warmup}} \\ \text{schedule}(t - T_{\text{warmup}}) & t > T_{\text{warmup}} \end{cases}

Warmup matters most for two reasons. First, large-batch training can produce very large initial gradients that destabilize training; warmup softens this. Second, adaptive optimizers like Adam have noisy second-moment estimates in the first few hundred iterations — the bias correction handles part of this, but warmup adds another layer of safety.

Typical warmup: 1000 steps for medium-sized models, 4000-10000 for very large Transformer training. After warmup, the learning rate transitions to whatever main schedule (cosine, step, linear) is being used.

Cosine with warmup: the Transformer recipe

Combining warmup and cosine annealing produces the schedule used in nearly every modern Transformer training recipe:

from torch.optim.lr_scheduler import LambdaLR
import math

def cosine_with_warmup(current_step):
    if current_step < warmup_steps:
        return current_step / max(1, warmup_steps)
    progress = (current_step - warmup_steps) / max(1, total_steps - warmup_steps)
    return 0.5 * (1.0 + math.cos(math.pi * progress))

scheduler = LambdaLR(optimizer, lr_lambda=cosine_with_warmup)

The shape: linear ramp from 0 to peak over the first warmup_steps updates, then cosine decay from peak to 0 over the remaining steps. BERT, GPT, T5, and most subsequent large-scale Transformer models use this schedule with minor variations.

ReduceLROnPlateau: adaptive based on validation loss

A different approach: instead of a predetermined schedule, monitor validation loss and reduce the learning rate when validation stops improving:

from torch.optim.lr_scheduler import ReduceLROnPlateau

scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',         # we want validation loss to decrease
    factor=0.5,         # halve the LR when triggered
    patience=3,         # wait 3 epochs before reducing
    threshold=1e-4,     # consider an improvement only if it's at least this much
)

for epoch in range(num_epochs):
    train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, val_loader)
    scheduler.step(val_loss)  # note: pass the metric to step()

ReduceLROnPlateau is reactive rather than predetermined. It works well when training duration is uncertain or when different stages of training take different amounts of time. The downside is that it requires a validation set to monitor, and the patience parameter introduces a hyperparameter that interacts with how the loss curve actually evolves.

Combining ReduceLROnPlateau with cosine annealing is sometimes useful: cosine for the smooth main descent, plateau-based reduction for cases where the model is still making progress at the end of the cosine cycle.

Cyclical and restart schedules

Cyclical learning rates oscillate between a minimum and a maximum, repeatedly. The motivation: the larger learning rate phase can escape sharp local minima, while the smaller phase fine-tunes precision.

Cosine annealing with warm restarts (SGDR) periodically resets the learning rate back to its peak and starts a new cosine cycle, often with a longer period each time:

from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts

scheduler = CosineAnnealingWarmRestarts(
    optimizer,
    T_0=10,        # length of first cycle
    T_mult=2,      # double cycle length each restart
    eta_min=1e-6,
)

Each restart can knock the optimizer out of a local minimum into a different basin, sometimes finding better generalization. SGDR is most useful for ensemble methods where you train one model with restarts and average the snapshots from different points in the training cycle.

How to pick a schedule

Three rules cover most situations:

For Transformer training and fine-tuning: linear warmup + cosine annealing. This is the modern default. Total steps and warmup ratio (typically 6-10% of total) are the only knobs to tune.

For CNN training (e.g., ResNet on ImageNet): step decay or cosine annealing both work. Step decay produces stair-step loss curves but is well-tested. Cosine annealing produces smoother curves and slightly better final accuracy in modern recipes.

For uncertain training duration or short experiments: ReduceLROnPlateau. The reactivity to validation loss handles cases where you do not know in advance how long training will take.

The 1-cycle policy: an aggressive variant

Leslie Smith's 1-cycle policy combines warmup, peak, and annealing into a single asymmetric cycle:

Phase A: linearly increase LR from $\eta_{\min}$ to $\eta_{\max}$ over the first 45% of training.
Phase B: linearly decrease LR from $\eta_{\max}$ to $\eta_{\min}$ over the next 45% of training.
Phase C: anneal LR from $\eta_{\min}$ down to $\eta_{\min} / 100$ over the last 10%.

Combined with momentum that varies inversely (high when LR is low, low when LR is high), 1-cycle can dramatically accelerate training in some settings. It is more aggressive than cosine annealing and benefits from a learning rate range test (a quick scan to find safe upper and lower bounds) before training.

The main takeaway

Learning rate schedules implement the universally applicable observation that no single learning rate is right for the whole of training. Warmup stabilizes the early phase. The middle phase needs large steps to make rapid progress. The end phase needs small steps to converge precisely.

In practice, three schedules cover almost everything:

Linear warmup + cosine annealing for Transformer training and most fine-tuning. Smooth, well-tested, the modern default.
Step decay for classical CNN training following the ResNet recipe. Simple, reliable, and battle-tested on ImageNet.
ReduceLROnPlateau for shorter or experimental training where the duration is uncertain. Adapts to whatever loss curve actually appears.

The optimizer is one decision, the schedule is another, and they interact. Adam with cosine annealing behaves quite differently from SGD with step decay, even on the same model and dataset. Treating both as part of a single training recipe — chosen together, tuned together — is what mature deep learning practice looks like.