L1 Regularization and Sparsity: Why It Auto-Deletes Features

The "Lazy" Way to Select Features While Training

Ever faced this dilemma? You've got 100 features, but no clue which ones actually matter and which are just noise.

Tried filter methods? You evaluate features separately, then train — two-step hassle. Wrapper methods? Every feature combo tweak means retraining the entire model. Painfully slow.

What if there's a "lazy" approach — one that automatically kicks out useless features during training?

There is. It's called embedded feature selection. And its secret weapon? L1 regularization.

This article tackles one crucial question: Why does L1 regularization produce sparse solutions (coefficients equal to zero), while L2 regularization doesn't? Nail this, and you've unlocked the essence of LASSO regression, Elastic Net, and a whole class of algorithms.

Embedded Selection: Training and Feature Selection in One Shot

Let's clarify what we mean. Suppose you're training a "good watermelon classifier" with 4 features:

Stem (fresh/withered)
Sound (crisp/dull)
Color (red/green)
Texture (clear/blurry)

Three feature selection approaches:

Filter: Evaluate each feature's contribution separately, pick "Stem + Sound," then train.

Wrapper: Try random feature combos, retrain, adjust based on results. Rinse and repeat.

Embedded: Bake "feature selection" into training — the model judges which features are useless and shrinks their influence to zero, effectively auto-excluding them.

After training, you might get a formula like this:

\text{Good Melon Score} = 0.8 \times \text{Stem} + 0.7 \times \text{Sound} + 0 \times \text{Color} + 0 \times \text{Texture}

Color and Texture coefficients = 0. The model completely ignored those two features. That's embedded selection: no separate feature-picking step, it happened automatically during training.

So how does the model know which coefficients to zero out?

Answer: L1 regularization.

Why L1 Produces Sparse Solutions

Embedded selection "kicks features" by adding a constraint to the training objective — regularization.

L1 regularization can drive some coefficients to exactly zero (sparse solution). L2 can only shrink coefficients, never zero them out. Why?

Let's use a "taxation" analogy.

1. The Model's Dual Goal: Accurate Yet Simple

Model training balances two objectives:

Objective 1: Predict accurately (low loss)
The score calculated from "Stem + Sound" should closely match the true "good/bad melon" labels.

Objective 2: Keep it simple (few features)
Don't use too many useless features (like Color, Texture), or the model gets complex and overfits.

Regularization enforces "Objective 2." L1 and L2 both add constraints, but their methods differ drastically, leading to vastly different outcomes.

2. L1 Regularization: "Tax by Absolute Value"

Think of feature coefficients as "each feature's budget." L1 regularization taxes the absolute value of budgets:

\text{Total Tax} = |w_{\text{Stem}}| + |w_{\text{Sound}}| + |w_{\text{Color}}| + |w_{\text{Texture}}|

To pay less tax, the model has two options:

Option A: Shrink all coefficients a bit
E.g., Stem from 0.8 to 0.7, Sound from 0.7 to 0.6.
Problem: This hurts prediction accuracy (features lose influence).

Option B: Zero out useless features
E.g., Color and Texture from 0.2 to 0.
Benefit: Total tax drops by 0.2 + 0.2 = 0.4, and prediction accuracy doesn't suffer (since those features were useless anyway).

So L1 forces the model to delete features: useless feature coefficients go straight to zero, while useful ones remain intact. That's a sparse solution.

3. L2 Regularization: "Tax by Squared Value"

L2 regularization taxes the square of coefficients:

\text{Total Tax} = w_{\text{Stem}}^2 + w_{\text{Sound}}^2 + w_{\text{Color}}^2 + w_{\text{Texture}}^2

Say Color's coefficient is 0.2. Squared: 0.04. Drop it to 0.1, squared: 0.01. Tax saved: 0.03.

Now drop further to 0.01. Squared: 0.0001. Tax saved: only 0.0099. Diminishing returns.

In this scenario, the model prefers to "shrink all useless feature coefficients" rather than "zero them out completely" — because even at 0.01, the squared tax is tiny. Setting it to 0 only saves another 0.0001 in tax. Not worth the loss.

So L2 merely makes useless feature coefficients very small (like 0.001) but never zero. Features stay in the model, just with minimal influence. Not a sparse solution.

4. Geometric View: Diamond vs Circle

Suppose the model has two features: Stem (coefficient $w_1$ ) and Color (coefficient $w_2$ ). Training seeks $w_1$ and $w_2$ balancing "accuracy" and "low tax."

This translates to finding where the loss contour (ellipse, closer to center = better prediction) meets the regularization contour.

L1's regularization contour is a diamond:

|w_1| + |w_2| = \text{constant}

Diamonds have sharp corners, located on the axes (e.g., $w_1 = 1, w_2 = 0$ ). The ellipse easily hits a corner, meaning one coefficient goes to zero — sparse solution.

L2's regularization contour is a circle:

w_1^2 + w_2^2 = \text{constant}

Circles are smooth, no corners. The ellipse rarely touches an axis when meeting the circle (instead landing somewhere like $w_1 = 0.8, w_2 = 0.6$ ) — both coefficients nonzero, not sparse.

Intuition

The diamond's sharp corners "actively push" the intersection toward axes (coefficient = 0). The circle lacks this structure; intersections rarely land on axes.

Real Example: LASSO Regression's Sparse Effect

LASSO (Least Absolute Shrinkage and Selection Operator) = Linear Regression + L1 Regularization.

Suppose 10 features predict house price. After LASSO training:

\text{Price} = 50000 \times \text{Location} + 30000 \times \text{Area} + 0 \times \text{Floor} + 0 \times \text{Orientation} + \ldots + 0 \times \text{Guard Height}

Out of 10 features, only 2 have nonzero coefficients (Location, Area). The other 8 are all zero.

This means:

Automatic feature selection: LASSO auto-identified the 2 most important features
Simplified model: Final model uses only 2 features, easy to interpret
Avoid overfitting: Fewer redundant features, stronger generalization

If you used ordinary linear regression (no regularization) or Ridge (L2), all 10 features would have coefficients. Some might be tiny (like 0.001), but none would be zero.

L1 vs L2: Summary Table

Aspect	L1 Regularization	L2 Regularization
Tax Method	Absolute value: \|w₁\| + \|w₂\|	Squared: w₁² + w₂²
Coefficient Behavior	Some go to exactly 0	All shrink, but never 0
Geometric Shape	Diamond (sharp corners)	Circle (smooth)
Sparsity	Produces sparse solutions	Cannot produce sparse solutions
Feature Selection	Auto-removes useless features	Only reduces feature weights
Typical Use	LASSO, Elastic Net	Ridge Regression

Why Sparsity Matters

1. Improves Interpretability

Sparse models use few features, easy to explain to stakeholders:

"House price mainly depends on location and area" (2 features)
vs
"House price depends on location, area, floor, orientation, renovation, complex, school district, subway, greenery, parking — all combined" (10 features)

Former: crystal clear. Latter: says nothing useful.

2. Avoids Overfitting

Fewer features = simpler model = less likely to memorize training noise = better generalization.

3. Speeds Up Inference

Prediction requires computing fewer features, faster. For mobile deployment, 10 features vs 2 features = 5x computational difference.

Real-World Applications

1. High-Dimensional Sparse Data: Text Classification

Problem: Text classification vocabularies may have 10,000 words, but each document uses only dozens.

Solution: Use LASSO or L1-regularized logistic regression to auto-select a few hundred key words (like "winner," "free" for spam), drop 9,000+ irrelevant ones.

Result: Faster, more accurate, more interpretable model.

2. Gene Expression Analysis

Problem: Gene chips have 20,000 gene features but only 100 samples (features vastly outnumber samples).

Solution: Use L1 regularization to pick 10-20 disease-related genes, ignore the other 19,000+.

Result: Avoid overfitting, help doctors find causal genes.

3. Recommendation Systems

Problem: User features run into hundreds of dimensions (age, gender, browsing history, purchase history...), many redundant.

Solution: Use L1 regularization to auto-remove redundancy, keep only 20-30 core features.

Result: Fast recommendations, lean model.

Key Takeaways

Embedded feature selection = training + feature selection in one step, no separate operations.
L1 regularization produces sparse solutions:
- Tax method: absolute value (|w|)
- Effect: forces model to zero out useless feature coefficients
- Geometry: diamond has sharp corners, intersection lands on axes easily
L2 regularization cannot produce sparse solutions:
- Tax method: squared (w²)
- Effect: only shrinks coefficients, never zeros them
- Geometry: circle is smooth, intersection rarely touches axes
Value of sparsity:
- Automatic feature selection
- Improved interpretability
- Avoid overfitting
- Faster inference
Typical applications: LASSO regression, L1-regularized logistic regression, Elastic Net

Want to master regularization techniques?

Dive deeper into machine learning fundamentals, from feature engineering to model optimization. Learn when to use L1, L2, or Elastic Net regularization for your specific problem.