From Overfitting to Regularization: Weight Decay and Dropout

A model that drives training error almost to zero can still be useless. If its accuracy collapses on new data, it did not learn the underlying pattern. It learned the training set too literally.

Regularization is the set of techniques we use to push the model away from that behavior. Two of the most common tools are weight decay and dropout. They act very differently, but both aim at the same target: better generalization.

Diagram contrasting weight decay, which continuously shrinks parameters, with dropout, which randomly disables units during training. — Weight decay changes the objective directly. Dropout changes the training dynamics by injecting structured noise.

Training error is not the quantity we actually care about

Let $\mathcal{D}_{\text{train}}$ denote the training set and $\mathcal{D}_{\text{test}}$ a separate test distribution.

The training error is the model's average loss or misclassification rate on the training set. The generalization error is the same quantity on previously unseen data. Their gap matters:

\text{generalization gap} = \mathcal{E}_{\text{test}} - \mathcal{E}_{\text{train}}

When that gap is small, the model is transferring what it learned. When the gap is large, the model has started fitting noise, accidental quirks, or sample-specific structure.

Overfitting and underfitting are bias-variance trade-offs in disguise

Underfitting happens when the model is too constrained to represent the signal in the data. Both training and test performance remain poor. Overfitting happens when the model is so flexible that it can drive training error down by memorizing peculiarities of the sample instead of the generative pattern.

In classical language, underfitting corresponds to high bias and lower variance. Overfitting corresponds to lower bias and higher variance. Regularization is one of the main ways we deliberately move the model back toward a better trade-off.

Weight decay adds an L2 penalty to the objective

For parameters $\mathbf{w}$ , the weight-decayed objective is

\min_{\mathbf{w}, b} \; \mathcal{L}(\mathbf{w}, b) + \frac{\lambda}{2}\lVert \mathbf{w} \rVert_2^2

The extra term penalizes large parameter magnitudes. Here $\lambda$ controls the strength of the penalty. Larger $\lambda$ means stronger shrinkage.

Differentiate with respect to $\mathbf{w}$ and the gradient becomes

\nabla_{\mathbf{w}}\left(\mathcal{L} + \frac{\lambda}{2}\lVert \mathbf{w} \rVert_2^2\right) = \nabla_{\mathbf{w}}\mathcal{L} + \lambda \mathbf{w}

Under SGD with learning rate $\eta$ , one update step is

\mathbf{w} \leftarrow \mathbf{w} - \eta\left(\nabla_{\mathbf{w}}\mathcal{L} + \lambda \mathbf{w}\right) = (1 - \eta\lambda)\mathbf{w} - \eta\nabla_{\mathbf{w}}\mathcal{L}

That multiplicative factor $1 - \eta\lambda$ is why the method is called weight decay. Even before the data gradient acts, the weights are being pulled toward zero.

Intuitively, smaller weights often imply smoother functions and less sensitivity to small input changes. In a Bayesian interpretation, L2 regularization corresponds to a Gaussian prior centered at zero over the weights.

Dropout regularizes by injecting multiplicative noise

Dropout works at the activation level rather than the objective level. During training, each hidden activation is randomly masked.

In the usual inverted-dropout form, for activation $h$ and drop probability $p$ ,

h' = \begin{cases} 0, & \text{with probability } p \\ \frac{h}{1-p}, & \text{with probability } 1-p \end{cases}

The scaling factor $\frac{1}{1-p}$ is chosen so the expectation stays unchanged:

\mathbb{E}[h'] = p \cdot 0 + (1-p)\frac{h}{1-p} = h

That means test-time inference does not need any additional scaling. You simply turn dropout off with model.eval().

Why dropout often improves generalization

Without dropout, hidden units can develop brittle co-adaptations. One unit learns to rely on a small set of other units always being present. That can work extremely well on the training set and generalize poorly.

Dropout breaks that dependency pattern by making the surrounding context unreliable. Each unit has to learn features that remain useful even when some of its neighbors disappear.

There is also an ensemble interpretation. Every dropout mask defines a slightly different subnetwork. Training with dropout is therefore a noisy approximation to training and averaging many related subnetworks.

Weight decay changes what the model prefers. Dropout changes what the model must survive during training.

A minimal PyTorch implementation

class MLP(nn.Module):
    def __init__(self, dropout_p=0.5):
        super().__init__()
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        return self.model(x)

model = MLP(dropout_p=0.5).to(device)
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4,
)

Two points matter here:

nn.Dropout only masks activations during training mode.
AdamW applies decoupled weight decay, which is usually preferable to mixing the penalty directly into Adam's adaptive moments.

That distinction matters. In plain SGD, the L2-penalized objective and the weight-decay update rule line up exactly. With Adam-style adaptive updates, the two ideas are no longer the same if you implement the penalty by simply adding $\lambda \mathbf{w}$ to the gradient. AdamW keeps the shrinkage term separate from Adam's moving-average machinery, so the practical update behaves much more like true weight decay than a naive L2 penalty inside standard Adam.

Hyperparameters still have to be tuned

Neither regularizer is magic. If $\lambda$ is too small, weight decay does very little. If it is too large, the model becomes oversmoothed and underfits. If dropout $p$ is too small, co-adaptation remains. If it is too large, the signal becomes too noisy to learn efficiently.

That is why you usually choose these values through validation or cross-validation rather than by theory alone. The mathematics explains the direction of the effect, not the perfect number for every dataset.

The main takeaway

Overfitting is not just “the model did too well on training data.” It is the specific failure mode where low training error does not transfer to unseen data.

Weight decay combats overfitting by penalizing large weights and shrinking the parameter vector.
Dropout combats overfitting by randomly masking hidden units and discouraging fragile feature co-adaptation.
Both methods work by lowering effective model complexity, but they do so through different mechanisms.

If you keep that distinction in mind, the implementation choices in PyTorch stop looking like recipes and start looking like direct consequences of the mathematics.

From Overfitting to Regularization: Weight Decay and Dropout

Training error is not the quantity we actually care about

Overfitting and underfitting are bias-variance trade-offs in disguise

Weight decay adds an L2 penalty to the objective

Dropout regularizes by injecting multiplicative noise

Why dropout often improves generalization

A minimal PyTorch implementation

Hyperparameters still have to be tuned

The main takeaway

Related reading

Gradient Descent, Explained from First Principles

The Chain Rule in Deep Learning: What Backpropagation Is Really Computing

Why Linear Regression Uses Squared Error