Master embedded feature selection through L1 regularization (LASSO). Learn how LASSO automatically performs feature selection during model training, understand the geometric interpretation of sparse solutions, and implement proximal gradient descent.
Embedded methods integrate feature selection into the model training process. Unlike filter and wrapper methods that perform feature selection separately, embedded methods automatically select features during learning by incorporating feature selection into the optimization objective.
Embedded methods combine the efficiency of filter methods with the performance benefits of wrapper methods. They achieve feature selection and model training simultaneously in a single optimization process, typically through regularization.
Understanding the evolution from basic linear regression to LASSO:
Standard linear regression minimizes squared error:
Problem: All features are used, no feature selection. May overfit with many features.
Add L2 penalty to prevent overfitting:
where is the L2 norm squared.
Result: Shrinks weights toward zero but rarely sets them exactly to zero. No feature selection.
Replace L2 with L1 penalty:
where is the L1 norm.
Result: Many weights become exactly zero, automatically performing feature selection!
The key difference between L1 and L2 regularization lies in the geometry of their constraint regions:
Constraint region: (circle/sphere)
The optimal solution (intersection of loss contour and constraint) typically lies inside the constraint region, away from axes.
Result: Weights are small but non-zero. No sparsity.
Constraint region: (diamond/octahedron)
The optimal solution often lies at a corner of the constraint region, where it intersects the coordinate axes.
Result: Some weights are exactly zero. Sparse solution!
The "pointy" corners of the L1 constraint (diamond) make it more likely that the optimal solution touches an axis, setting that weight to zero. The smooth, rounded L2 constraint (circle) rarely produces exact zeros. This geometric property is why L1 regularization naturally performs feature selection.
LASSO optimization cannot use standard gradient descent because the L1 norm is not differentiable at zero. Proximal Gradient Descent solves this by handling the non-smooth L1 term separately.
LASSO objective:
where is smooth (differentiable), and is non-smooth.
Take a gradient step on the smooth part :
where is the Lipschitz constant of (step size parameter).
Apply the proximal operator for L1 norm (soft thresholding):
This operation "shrinks" toward zero and sets it to zero if it's within the threshold .
Repeat steps 1-2 until converges. The final solution will have many zero weights, automatically selecting the relevant features.
The soft thresholding step explicitly sets small weights to zero, creating sparsity. The threshold controls the sparsity level: larger produces more zeros (more feature selection).
Understanding when to use each:
| Property | L1 (LASSO) | L2 (Ridge) |
|---|---|---|
| Sparsity | Yes (produces zeros) | No (weights shrink but non-zero) |
| Feature Selection | Automatic | No |
| Optimization | Proximal gradient descent | Standard gradient descent |
| Interpretability | High (sparse model) | Lower (all features used) |
| Use Case | Feature selection, high-dimensional data | Overfitting prevention, multicollinearity |
| Geometric Shape | Diamond (pointy corners) | Circle (smooth) |
In bioinformatics, we often have thousands of gene expression features but only hundreds of samples. LASSO automatically selects relevant genes for disease classification.
Dataset: 200 samples, 5000 gene expression features. Task: Classify cancer vs. normal tissue.
With , LASSO selects 47 genes (weights non-zero) and sets 4953 genes to zero.
Selected genes are biologically interpretable and the model achieves 92% accuracy, comparable to using all features but with much better interpretability.
Ridge regression uses all 5000 genes with small weights. While accuracy is similar (93%), the model is less interpretable and doesn't identify key genes for further biological study.
Embedded methods integrate feature selection into model training through regularization, achieving efficiency and performance simultaneously.
L1 regularization (LASSO) produces sparse solutions by setting many weights to zero, automatically performing feature selection.
The geometric "pointy corners" of the L1 constraint make sparse solutions more likely than with the smooth L2 constraint.
Proximal Gradient Descent handles the non-smooth L1 term through soft thresholding, explicitly creating sparsity.
LASSO is ideal for high-dimensional problems () where feature selection is critical for interpretability and generalization.
The regularization parameter controls the sparsity level: larger values produce more feature selection.