Master filter, wrapper, and embedded methods for selecting optimal feature subsets
Statistical tests, model-independent
✓ Fast, scalable
✗ Ignores feature interactions
Use model performance
✓ Considers interactions
✗ Computationally expensive
Built into model training
✓ Balance speed & accuracy
✗ Model-specific
Mutual information between feature X and target Y:
Measures reduction in uncertainty about Y given X
Via Entropy:
Via KL Divergence:
Distance from independence
For discrete variables, use empirical frequencies:
For continuous: use binning or kernel density estimation
LASSO (L1):
Constraint region: Diamond (has corners at axes)
Ridge (L2):
Constraint region: Circle (smooth, no corners)
Why corners matter:
Subdifferential of ||w||_1 at w_i = 0:
Optimality condition: If gradient from loss lies in [-λ, λ], then w_i = 0 is optimal.
This happens for many features when λ is large → automatic feature selection!
Combines L1 and L2 penalties:
• O_ij = observed frequency in cell (i,j)
• E_ij = expected frequency under independence = (row_i × col_j) / n
Under null hypothesis (independence), χ² follows chi-square distribution with:
Large χ² → reject independence → feature is relevant!
Total variance = Between-group + Within-group:
Between-group SS:
Within-group SS:
• Numerator: Between-group mean square
• Denominator: Within-group mean square
• Under H_0 (all means equal), F ~ F(K-1, n-K) distribution
Large F → group means differ → feature discriminates classes well!
Input: Features F, target number k
While |F| > k:
1. Train model on current features F
2. Rank features by importance/weights
3. Remove least important feature
4. F ← F without removed feature
Return: F (final feature subset)
Starting with p features, removing to k features:
where T(n, p) is cost of training model with n samples and p features.
Example: SVM-RFE with p=1000, k=10 trains ~990 models!
Problem: How to choose k?
Solution: Run RFE with cross-validation at each step, plot performance vs number of features, choose k at elbow/peak.
• r ∈ [-1, 1]
• |r| ≈ 1: strong linear relationship
• r ≈ 0: no linear relationship
Ideal feature:
Problem:
If X_1 and X_2 highly correlated and both correlated with y, including both adds little value (redundancy)
Select features that maximize:
First term: relevance (MI with target)
Second term: redundancy (MI between features)
Run feature selection on M bootstrap samples, get feature sets S_1,...,S_M
• Stability ∈ [0,1]
• High stability → robust feature selection
• Low stability → unstable, sensitive to data perturbations
Task: Classify cancer types using gene expression data (20,000 genes, 200 samples)
Filter (Mutual Info)
Selected top 100 genes in seconds. Accuracy: 85%
RFE with SVM
Selected 50 genes in hours. Accuracy: 92%
LASSO
Automatically selected 75 genes. Accuracy: 90%
Best Practice:
Use filter to reduce from 20,000 → 500, then wrapper/embedded to refine → 50-100 features. This two-stage approach is fast and effective.
Mean Decrease in Impurity: Total reduction in Gini impurity by feature j across all trees
• p(n) = fraction of samples reaching node n
• ΔGini(n) = impurity decrease from split
• Fast to compute (available during training)
• Biased toward high-cardinality features
Algorithm:
Model-agnostic, unbiased, but computationally expensive (requires n_features × n_permutations predictions)
SHapley Additive exPlanations: Game-theoretic approach to feature importance
• Average marginal contribution of feature j across all possible feature subsets
• Satisfies desirable axioms (efficiency, symmetry, dummy, additivity)
• TreeSHAP: efficient algorithm for tree ensembles
Use binomial test to check if feature consistently beats shadow features across iterations
Controls false positive rate, tends to select more features than other methods (all-relevant vs minimal-optimal)
| Method | Type | Complexity | Interactions | Best Use Case |
|---|---|---|---|---|
| Mutual Information | Filter | O(nd) | No | Fast screening, non-linear |
| Chi²/ANOVA | Filter | O(nd) | No | Categorical/continuous targets |
| LASSO | Embedded | O(nd²) | Limited | Linear models, automatic |
| RFE | Wrapper | O(d·T_model) | Yes | High accuracy, small d |
| RF Importance | Embedded | O(ndT log n) | Yes | Tree models, interpretability |
| Boruta | Wrapper | O(I·ndT log n) | Yes | All-relevant features |
Test your understanding with 10 multiple-choice questions