A model that drives training error almost to zero can still be useless. If its accuracy collapses on new data, it did not learn the underlying pattern. It learned the training set too literally.
Regularization is the set of techniques we use to push the model away from that behavior. Two of the most common tools are weight decay and dropout. They act very differently, but both aim at the same target: better generalization.
Training error is not the quantity we actually care about
Let denote the training set and a separate test distribution.
The training error is the model's average loss or misclassification rate on the training set. The generalization error is the same quantity on previously unseen data. Their gap matters:
When that gap is small, the model is transferring what it learned. When the gap is large, the model has started fitting noise, accidental quirks, or sample-specific structure.
Overfitting and underfitting are bias-variance trade-offs in disguise
Underfitting happens when the model is too constrained to represent the signal in the data. Both training and test performance remain poor. Overfitting happens when the model is so flexible that it can drive training error down by memorizing peculiarities of the sample instead of the generative pattern.
In classical language, underfitting corresponds to high bias and lower variance. Overfitting corresponds to lower bias and higher variance. Regularization is one of the main ways we deliberately move the model back toward a better trade-off.
Weight decay adds an L2 penalty to the objective
For parameters , the weight-decayed objective is
The extra term penalizes large parameter magnitudes. Here controls the strength of the penalty. Larger means stronger shrinkage.
Differentiate with respect to and the gradient becomes
Under SGD with learning rate , one update step is
That multiplicative factor is why the method is called weight decay. Even before the data gradient acts, the weights are being pulled toward zero.
Intuitively, smaller weights often imply smoother functions and less sensitivity to small input changes. In a Bayesian interpretation, L2 regularization corresponds to a Gaussian prior centered at zero over the weights.
Dropout regularizes by injecting multiplicative noise
Dropout works at the activation level rather than the objective level. During training, each hidden activation is randomly masked.
In the usual inverted-dropout form, for activation and drop probability ,
The scaling factor is chosen so the expectation stays unchanged:
That means test-time inference does not need any additional scaling. You simply turn dropout off with model.eval().
Why dropout often improves generalization
Without dropout, hidden units can develop brittle co-adaptations. One unit learns to rely on a small set of other units always being present. That can work extremely well on the training set and generalize poorly.
Dropout breaks that dependency pattern by making the surrounding context unreliable. Each unit has to learn features that remain useful even when some of its neighbors disappear.
There is also an ensemble interpretation. Every dropout mask defines a slightly different subnetwork. Training with dropout is therefore a noisy approximation to training and averaging many related subnetworks.
Weight decay changes what the model prefers. Dropout changes what the model must survive during training.
A minimal PyTorch implementation
class MLP(nn.Module):
def __init__(self, dropout_p=0.5):
super().__init__()
self.model = nn.Sequential(
nn.Flatten(),
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(p=dropout_p),
nn.Linear(256, 10),
)
def forward(self, x):
return self.model(x)
model = MLP(dropout_p=0.5).to(device)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=1e-4,
)Two points matter here:
nn.Dropoutonly masks activations during training mode.AdamWapplies decoupled weight decay, which is usually preferable to mixing the penalty directly into Adam's adaptive moments.
That distinction matters. In plain SGD, the L2-penalized objective and the weight-decay update rule line up exactly. With Adam-style adaptive updates, the two ideas are no longer the same if you implement the penalty by simply adding to the gradient. AdamW keeps the shrinkage term separate from Adam's moving-average machinery, so the practical update behaves much more like true weight decay than a naive L2 penalty inside standard Adam.
Hyperparameters still have to be tuned
Neither regularizer is magic. If is too small, weight decay does very little. If it is too large, the model becomes oversmoothed and underfits. If dropout is too small, co-adaptation remains. If it is too large, the signal becomes too noisy to learn efficiently.
That is why you usually choose these values through validation or cross-validation rather than by theory alone. The mathematics explains the direction of the effect, not the perfect number for every dataset.
The main takeaway
Overfitting is not just “the model did too well on training data.” It is the specific failure mode where low training error does not transfer to unseen data.
- Weight decay combats overfitting by penalizing large weights and shrinking the parameter vector.
- Dropout combats overfitting by randomly masking hidden units and discouraging fragile feature co-adaptation.
- Both methods work by lowering effective model complexity, but they do so through different mechanisms.
If you keep that distinction in mind, the implementation choices in PyTorch stop looking like recipes and start looking like direct consequences of the mathematics.