MathIsimple

Embedded Methods: LASSO and L1 Regularization

Master embedded feature selection through L1 regularization (LASSO). Learn how LASSO automatically performs feature selection during model training, understand the geometric interpretation of sparse solutions, and implement proximal gradient descent.

Module 4 of 8
Intermediate to Advanced
120-150 min

What are Embedded Methods?

Embedded methods integrate feature selection into the model training process. Unlike filter and wrapper methods that perform feature selection separately, embedded methods automatically select features during learning by incorporating feature selection into the optimization objective.

Key Advantage

Embedded methods combine the efficiency of filter methods with the performance benefits of wrapper methods. They achieve feature selection and model training simultaneously in a single optimization process, typically through regularization.

Advantages

  • • Efficient (no separate selection step)
  • • Learner-specific optimization
  • • Automatic feature selection
  • • Produces sparse models
  • • Good generalization

Limitations

  • • Requires regularization tuning
  • • Sensitive to hyperparameters
  • • May be sensitive to noise
  • • Less interpretable than filters
  • • Requires optimization expertise

From Linear Regression to LASSO

Understanding the evolution from basic linear regression to LASSO:

Step 1: Basic Linear Regression

Standard linear regression minimizes squared error:

minwi=1m(yiwTxi)2\min_w \sum_{i=1}^m (y_i - w^T x_i)^2

Problem: All features are used, no feature selection. May overfit with many features.

Step 2: L2 Regularization (Ridge Regression)

Add L2 penalty to prevent overfitting:

minwi=1m(yiwTxi)2+λw22\min_w \sum_{i=1}^m (y_i - w^T x_i)^2 + \lambda \|w\|_2^2

where w22=j=1dwj2\|w\|_2^2 = \sum_{j=1}^d w_j^2 is the L2 norm squared.

Result: Shrinks weights toward zero but rarely sets them exactly to zero. No feature selection.

Step 3: L1 Regularization (LASSO)

Replace L2 with L1 penalty:

minwi=1m(yiwTxi)2+λw1\min_w \sum_{i=1}^m (y_i - w^T x_i)^2 + \lambda \|w\|_1

where w1=j=1dwj\|w\|_1 = \sum_{j=1}^d |w_j| is the L1 norm.

Result: Many weights become exactly zero, automatically performing feature selection!

Geometric Interpretation: Why L1 Produces Sparse Solutions

The key difference between L1 and L2 regularization lies in the geometry of their constraint regions:

L2 Regularization (Ridge)

Constraint region: w22t\|w\|_2^2 \leq t (circle/sphere)

The optimal solution (intersection of loss contour and constraint) typically lies inside the constraint region, away from axes.

Result: Weights are small but non-zero. No sparsity.

L1 Regularization (LASSO)

Constraint region: w1t\|w\|_1 \leq t (diamond/octahedron)

The optimal solution often lies at a corner of the constraint region, where it intersects the coordinate axes.

Result: Some weights are exactly zero. Sparse solution!

Key Insight

The "pointy" corners of the L1 constraint (diamond) make it more likely that the optimal solution touches an axis, setting that weight to zero. The smooth, rounded L2 constraint (circle) rarely produces exact zeros. This geometric property is why L1 regularization naturally performs feature selection.

Proximal Gradient Descent (PGD)

LASSO optimization cannot use standard gradient descent because the L1 norm is not differentiable at zero. Proximal Gradient Descent solves this by handling the non-smooth L1 term separately.

Problem Formulation

LASSO objective:

minwf(w)+λw1\min_w f(w) + \lambda \|w\|_1

where f(w)=i=1m(yiwTxi)2f(w) = \sum_{i=1}^m (y_i - w^T x_i)^2 is smooth (differentiable), and λw1\lambda \|w\|_1 is non-smooth.

Step 1

Gradient Step

Take a gradient step on the smooth part f(w)f(w):

z=wk1Lf(wk)z = w_k - \frac{1}{L} \nabla f(w_k)

where LL is the Lipschitz constant of f\nabla f(step size parameter).

Step 2

Proximal Mapping (Soft Thresholding)

Apply the proximal operator for L1 norm (soft thresholding):

wk+1i={ziλLif zi>λL0if ziλLzi+λLif zi<λLw_{k+1}^i = \begin{cases} z^i - \frac{\lambda}{L} & \text{if } z^i > \frac{\lambda}{L} \\ 0 & \text{if } |z^i| \leq \frac{\lambda}{L} \\ z^i + \frac{\lambda}{L} & \text{if } z^i < -\frac{\lambda}{L} \end{cases}

This operation "shrinks" ziz^i toward zero and sets it to zero if it's within the threshold λ/L\lambda/L.

Step 3

Iterate Until Convergence

Repeat steps 1-2 until wkw_k converges. The final solutionww^* will have many zero weights, automatically selecting the relevant features.

Why It Works

The soft thresholding step explicitly sets small weights to zero, creating sparsity. The threshold λ/L\lambda/L controls the sparsity level: largerλ\lambda produces more zeros (more feature selection).

L1 vs L2 Regularization: Detailed Comparison

Understanding when to use each:

PropertyL1 (LASSO)L2 (Ridge)
SparsityYes (produces zeros)No (weights shrink but non-zero)
Feature SelectionAutomaticNo
OptimizationProximal gradient descentStandard gradient descent
InterpretabilityHigh (sparse model)Lower (all features used)
Use CaseFeature selection, high-dimensional dataOverfitting prevention, multicollinearity
Geometric ShapeDiamond (pointy corners)Circle (smooth)

When to Use Each

  • Use L1 (LASSO): When you need feature selection, have many irrelevant features, want interpretable sparse models, or work with high-dimensional data (d>md > m).
  • Use L2 (Ridge): When all features are potentially relevant, you need to handle multicollinearity, or want smooth weight shrinkage without sparsity.
  • Use Both (Elastic Net): Combine L1 and L2:λ1w1+λ2w22\lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2 to get benefits of both.

Practical Example: Gene Selection

In bioinformatics, we often have thousands of gene expression features but only hundreds of samples. LASSO automatically selects relevant genes for disease classification.

Problem Setup

Dataset: 200 samples, 5000 gene expression features. Task: Classify cancer vs. normal tissue.

LASSO Solution

With λ=0.1\lambda = 0.1, LASSO selects 47 genes (weights non-zero) and sets 4953 genes to zero.

Selected genes are biologically interpretable and the model achieves 92% accuracy, comparable to using all features but with much better interpretability.

Comparison with Ridge

Ridge regression uses all 5000 genes with small weights. While accuracy is similar (93%), the model is less interpretable and doesn't identify key genes for further biological study.

Key Takeaways

Embedded methods integrate feature selection into model training through regularization, achieving efficiency and performance simultaneously.

L1 regularization (LASSO) produces sparse solutions by setting many weights to zero, automatically performing feature selection.

The geometric "pointy corners" of the L1 constraint make sparse solutions more likely than with the smooth L2 constraint.

Proximal Gradient Descent handles the non-smooth L1 term through soft thresholding, explicitly creating sparsity.

LASSO is ideal for high-dimensional problems (d>md > m) where feature selection is critical for interpretability and generalization.

The regularization parameter λ\lambda controls the sparsity level: larger values produce more feature selection.