MathIsimple
Article
13 min read

L1 Regularization and Sparsity: Why It Auto-Deletes Features

The taxation analogy that explains LASSO's secret weapon

2026-01-24
L1 Regularization
Sparsity
LASSO
Feature Selection
Regularization

The "Lazy" Way to Select Features While Training

Ever faced this dilemma? You've got 100 features, but no clue which ones actually matter and which are just noise.

Tried filter methods? You evaluate features separately, then train — two-step hassle. Wrapper methods? Every feature combo tweak means retraining the entire model. Painfully slow.

What if there's a "lazy" approach — one that automatically kicks out useless features during training?

There is. It's called embedded feature selection. And its secret weapon? L1 regularization.

This article tackles one crucial question: Why does L1 regularization produce sparse solutions (coefficients equal to zero), while L2 regularization doesn't? Nail this, and you've unlocked the essence of LASSO regression, Elastic Net, and a whole class of algorithms.

Embedded Selection: Training and Feature Selection in One Shot

Let's clarify what we mean. Suppose you're training a "good watermelon classifier" with 4 features:

  • Stem (fresh/withered)
  • Sound (crisp/dull)
  • Color (red/green)
  • Texture (clear/blurry)

Three feature selection approaches:

Filter: Evaluate each feature's contribution separately, pick "Stem + Sound," then train.

Wrapper: Try random feature combos, retrain, adjust based on results. Rinse and repeat.

Embedded: Bake "feature selection" into training — the model judges which features are useless and shrinks their influence to zero, effectively auto-excluding them.

After training, you might get a formula like this:

Good Melon Score=0.8×Stem+0.7×Sound+0×Color+0×Texture\text{Good Melon Score} = 0.8 \times \text{Stem} + 0.7 \times \text{Sound} + 0 \times \text{Color} + 0 \times \text{Texture}

Color and Texture coefficients = 0. The model completely ignored those two features. That's embedded selection: no separate feature-picking step, it happened automatically during training.

So how does the model know which coefficients to zero out?

Answer: L1 regularization.

Why L1 Produces Sparse Solutions

Embedded selection "kicks features" by adding a constraint to the training objective — regularization.

L1 regularization can drive some coefficients to exactly zero (sparse solution). L2 can only shrink coefficients, never zero them out. Why?

Let's use a "taxation" analogy.

1. The Model's Dual Goal: Accurate Yet Simple

Model training balances two objectives:

Objective 1: Predict accurately (low loss)
The score calculated from "Stem + Sound" should closely match the true "good/bad melon" labels.

Objective 2: Keep it simple (few features)
Don't use too many useless features (like Color, Texture), or the model gets complex and overfits.

Regularization enforces "Objective 2." L1 and L2 both add constraints, but their methods differ drastically, leading to vastly different outcomes.

2. L1 Regularization: "Tax by Absolute Value"

Think of feature coefficients as "each feature's budget." L1 regularization taxes the absolute value of budgets:

Total Tax=wStem+wSound+wColor+wTexture\text{Total Tax} = |w_{\text{Stem}}| + |w_{\text{Sound}}| + |w_{\text{Color}}| + |w_{\text{Texture}}|

To pay less tax, the model has two options:

Option A: Shrink all coefficients a bit
E.g., Stem from 0.8 to 0.7, Sound from 0.7 to 0.6.
Problem: This hurts prediction accuracy (features lose influence).

Option B: Zero out useless features
E.g., Color and Texture from 0.2 to 0.
Benefit: Total tax drops by 0.2 + 0.2 = 0.4, and prediction accuracy doesn't suffer (since those features were useless anyway).

So L1 forces the model to delete features: useless feature coefficients go straight to zero, while useful ones remain intact. That's a sparse solution.

3. L2 Regularization: "Tax by Squared Value"

L2 regularization taxes the square of coefficients:

Total Tax=wStem2+wSound2+wColor2+wTexture2\text{Total Tax} = w_{\text{Stem}}^2 + w_{\text{Sound}}^2 + w_{\text{Color}}^2 + w_{\text{Texture}}^2

Say Color's coefficient is 0.2. Squared: 0.04. Drop it to 0.1, squared: 0.01. Tax saved: 0.03.

Now drop further to 0.01. Squared: 0.0001. Tax saved: only 0.0099. Diminishing returns.

In this scenario, the model prefers to "shrink all useless feature coefficients" rather than "zero them out completely" — because even at 0.01, the squared tax is tiny. Setting it to 0 only saves another 0.0001 in tax. Not worth the loss.

So L2 merely makes useless feature coefficients very small (like 0.001) but never zero. Features stay in the model, just with minimal influence. Not a sparse solution.

4. Geometric View: Diamond vs Circle

Suppose the model has two features: Stem (coefficient w1w_1) and Color (coefficient w2w_2). Training seeks w1w_1 and w2w_2 balancing "accuracy" and "low tax."

This translates to finding where the loss contour (ellipse, closer to center = better prediction) meets the regularization contour.

L1's regularization contour is a diamond:

w1+w2=constant|w_1| + |w_2| = \text{constant}

Diamonds have sharp corners, located on the axes (e.g., w1=1,w2=0w_1 = 1, w_2 = 0). The ellipse easily hits a corner, meaning one coefficient goes to zero — sparse solution.

L2's regularization contour is a circle:

w12+w22=constantw_1^2 + w_2^2 = \text{constant}

Circles are smooth, no corners. The ellipse rarely touches an axis when meeting the circle (instead landing somewhere like w1=0.8,w2=0.6w_1 = 0.8, w_2 = 0.6) — both coefficients nonzero, not sparse.

Intuition

The diamond's sharp corners "actively push" the intersection toward axes (coefficient = 0). The circle lacks this structure; intersections rarely land on axes.

Real Example: LASSO Regression's Sparse Effect

LASSO (Least Absolute Shrinkage and Selection Operator) = Linear Regression + L1 Regularization.

Suppose 10 features predict house price. After LASSO training:

Price=50000×Location+30000×Area+0×Floor+0×Orientation++0×Guard Height\text{Price} = 50000 \times \text{Location} + 30000 \times \text{Area} + 0 \times \text{Floor} + 0 \times \text{Orientation} + \ldots + 0 \times \text{Guard Height}

Out of 10 features, only 2 have nonzero coefficients (Location, Area). The other 8 are all zero.

This means:

  1. Automatic feature selection: LASSO auto-identified the 2 most important features
  2. Simplified model: Final model uses only 2 features, easy to interpret
  3. Avoid overfitting: Fewer redundant features, stronger generalization

If you used ordinary linear regression (no regularization) or Ridge (L2), all 10 features would have coefficients. Some might be tiny (like 0.001), but none would be zero.

L1 vs L2: Summary Table

AspectL1 RegularizationL2 Regularization
Tax MethodAbsolute value: |w₁| + |w₂|Squared: w₁² + w₂²
Coefficient BehaviorSome go to exactly 0All shrink, but never 0
Geometric ShapeDiamond (sharp corners)Circle (smooth)
SparsityProduces sparse solutionsCannot produce sparse solutions
Feature SelectionAuto-removes useless featuresOnly reduces feature weights
Typical UseLASSO, Elastic NetRidge Regression

Why Sparsity Matters

1. Improves Interpretability

Sparse models use few features, easy to explain to stakeholders:

"House price mainly depends on location and area" (2 features)
vs
"House price depends on location, area, floor, orientation, renovation, complex, school district, subway, greenery, parking — all combined" (10 features)

Former: crystal clear. Latter: says nothing useful.

2. Avoids Overfitting

Fewer features = simpler model = less likely to memorize training noise = better generalization.

3. Speeds Up Inference

Prediction requires computing fewer features, faster. For mobile deployment, 10 features vs 2 features = 5x computational difference.

Real-World Applications

1. High-Dimensional Sparse Data: Text Classification

Problem: Text classification vocabularies may have 10,000 words, but each document uses only dozens.

Solution: Use LASSO or L1-regularized logistic regression to auto-select a few hundred key words (like "winner," "free" for spam), drop 9,000+ irrelevant ones.

Result: Faster, more accurate, more interpretable model.

2. Gene Expression Analysis

Problem: Gene chips have 20,000 gene features but only 100 samples (features vastly outnumber samples).

Solution: Use L1 regularization to pick 10-20 disease-related genes, ignore the other 19,000+.

Result: Avoid overfitting, help doctors find causal genes.

3. Recommendation Systems

Problem: User features run into hundreds of dimensions (age, gender, browsing history, purchase history...), many redundant.

Solution: Use L1 regularization to auto-remove redundancy, keep only 20-30 core features.

Result: Fast recommendations, lean model.

Key Takeaways

  1. Embedded feature selection = training + feature selection in one step, no separate operations.
  2. L1 regularization produces sparse solutions:
    • Tax method: absolute value (|w|)
    • Effect: forces model to zero out useless feature coefficients
    • Geometry: diamond has sharp corners, intersection lands on axes easily
  3. L2 regularization cannot produce sparse solutions:
    • Tax method: squared (w²)
    • Effect: only shrinks coefficients, never zeros them
    • Geometry: circle is smooth, intersection rarely touches axes
  4. Value of sparsity:
    • Automatic feature selection
    • Improved interpretability
    • Avoid overfitting
    • Faster inference
  5. Typical applications: LASSO regression, L1-regularized logistic regression, Elastic Net

Want to master regularization techniques?

Dive deeper into machine learning fundamentals, from feature engineering to model optimization. Learn when to use L1, L2, or Elastic Net regularization for your specific problem.

Ask AI ✨
L1 Regularization and Sparsity: Why It Auto-Deletes Features | MathIsimple