MathIsimple – Simple, Friendly Math Tools & Learning

Three Categories of Feature Selection

Filter Methods

Statistical tests, model-independent

• Correlation coefficient
• Mutual information
• Chi-square test
• ANOVA F-test
• Variance threshold

✓ Fast, scalable

✗ Ignores feature interactions

Wrapper Methods

Use model performance

• Forward selection
• Backward elimination
• Recursive Feature Elimination
• Genetic algorithms

✓ Considers interactions

✗ Computationally expensive

Embedded Methods

Built into model training

• LASSO (L1 regularization)
• Elastic Net
• Random Forest importance
• Gradient Boosting importance

✓ Balance speed & accuracy

✗ Model-specific

Mutual Information Mathematical Definition

Information-Theoretic Feature Relevance

Definition

Mutual information between feature X and target Y:

I(X;Y) = \sum_{x} \sum_{y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

Measures reduction in uncertainty about Y given X

Equivalent Forms

Via Entropy:

I(X;Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)

Via KL Divergence:

I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y))

Distance from independence

Properties

• Non-negative: I(X;Y) ≥ 0, with equality iff X,Y independent
• Symmetric: I(X;Y) = I(Y;X)
• Bounded: I(X;Y) ≤ min[H(X), H(Y)]
• Detects non-linear relationships (unlike correlation)

Estimation from Data

For discrete variables, use empirical frequencies:

\hat{I}(X;Y) = \sum_x \sum_y \frac{n_{xy}}{n} \log \frac{n \cdot n_{xy}}{n_x \cdot n_y}

For continuous: use binning or kernel density estimation

LASSO: L1 Sparsity Proof

Why L1 Produces Exact Zeros

Constrained Optimization View

LASSO (L1):

\min \|\mathbf{y} - \mathbf{Xw}\|^2 \text{ s.t. } \|\mathbf{w}\|_1 \leq t

Constraint region: Diamond (has corners at axes)

Ridge (L2):

\min \|\mathbf{y} - \mathbf{Xw}\|^2 \text{ s.t. } \|\mathbf{w}\|_2 \leq t

Constraint region: Circle (smooth, no corners)

Geometric Intuition

Why corners matter:

• OLS contours (ellipses) expand until touching constraint region
• L1 diamond has corners on axes → likely to touch at corner
• Touching at corner means some w_i = 0 (exact sparsity)
• L2 circle is smooth → almost never touches at axis

Subgradient Analysis

Subdifferential of ||w||_1 at w_i = 0:

\partial \|\mathbf{w}\|_1 |_{w_i=0} = [-\lambda, \lambda]

Optimality condition: If gradient from loss lies in [-λ, λ], then w_i = 0 is optimal.

This happens for many features when λ is large → automatic feature selection!

Elastic Net

Combines L1 and L2 penalties:

\min \|\mathbf{y} - \mathbf{Xw}\|^2 + \lambda_1\|\mathbf{w}\|_1 + \lambda_2\|\mathbf{w}\|_2^2

• L1 part: feature selection (sparsity)
• L2 part: handles multicollinearity, grouping effect
• Best of both worlds for high-dimensional data

Chi-Square Test for Independence

Statistical Test for Categorical Features

Test Statistic

\chi^2 = \sum_i \sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

• O_ij = observed frequency in cell (i,j)
• E_ij = expected frequency under independence = (row_i × col_j) / n

Distribution

Under null hypothesis (independence), χ² follows chi-square distribution with:

df = (\text{rows}-1) \times (\text{cols}-1)

Large χ² → reject independence → feature is relevant!

ANOVA F-Test Derivation

Testing Mean Differences for Continuous Features

Sum of Squares Decomposition

Total variance = Between-group + Within-group:

SS_{\text{total}} = SS_{\text{between}} + SS_{\text{within}}

Between-group SS:

SS_B = \sum_k n_k(\bar{x}_k - \bar{x})^2

Within-group SS:

SS_W = \sum_k \sum_{i \in k}(x_i - \bar{x}_k)^2

F-Statistic

F = \frac{SS_B/(K-1)}{SS_W/(n-K)}

• Numerator: Between-group mean square
• Denominator: Within-group mean square
• Under H_0 (all means equal), F ~ F(K-1, n-K) distribution

Large F → group means differ → feature discriminates classes well!

Recursive Feature Elimination

Greedy Backward Selection with Model Feedback

Algorithm

Input: Features F, target number k

While |F| > k:

1. Train model on current features F

2. Rank features by importance/weights

3. Remove least important feature

4. F ← F without removed feature

Return: F (final feature subset)

Feature Importance Criteria

• Linear models (SVM, logistic): Use |w_i| (coefficient magnitude)
• Tree models: Use Gini importance or permutation importance
• General: Change in model performance when feature removed

Complexity Analysis

Starting with p features, removing to k features:

O((p-k) \times T(n, p))

where T(n, p) is cost of training model with n samples and p features.

Example: SVM-RFE with p=1000, k=10 trains ~990 models!

RFE-CV: Cross-Validated Version

Problem: How to choose k?

Solution: Run RFE with cross-validation at each step, plot performance vs number of features, choose k at elbow/peak.

Correlation-Based Feature Selection

Handling Feature Redundancy

Pearson Correlation

r_{xy} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i(x_i-\bar{x})^2}\sqrt{\sum_i(y_i-\bar{y})^2}}

• r ∈ [-1, 1]
• |r| ≈ 1: strong linear relationship
• r ≈ 0: no linear relationship

Feature-Feature vs Feature-Target Correlation

Ideal feature:

• High correlation with target: |r(X_i, y)| large
• Low correlation with other features: |r(X_i, X_j)| small

Problem:

If X_1 and X_2 highly correlated and both correlated with y, including both adds little value (redundancy)

mRMR: Minimum Redundancy Maximum Relevance

Select features that maximize:

\max_S \left[\frac{1}{|S|}\sum_{x_i \in S} I(x_i; y) - \frac{1}{|S|^2}\sum_{x_i, x_j \in S} I(x_i; x_j)\right]

First term: relevance (MI with target)
Second term: redundancy (MI between features)

Feature Selection Stability

Measuring Robustness of Selected Features

Stability Index

Run feature selection on M bootstrap samples, get feature sets S_1,...,S_M

\text{Stability} = \frac{2}{M(M-1)}\sum_{i<j} \frac{|S_i \cap S_j|}{|S_i \cup S_j|}

• Stability ∈ [0,1]
• High stability → robust feature selection
• Low stability → unstable, sensitive to data perturbations

Example: Gene Expression Analysis

Task: Classify cancer types using gene expression data (20,000 genes, 200 samples)

Filter (Mutual Info)

Selected top 100 genes in seconds. Accuracy: 85%

RFE with SVM

Selected 50 genes in hours. Accuracy: 92%

LASSO

Automatically selected 75 genes. Accuracy: 90%

Best Practice:

Use filter to reduce from 20,000 → 500, then wrapper/embedded to refine → 50-100 features. This two-stage approach is fast and effective.

View More Examples & Exercises

Tree-Based Feature Importance

Gini Importance and Permutation Importance

Gini Importance (MDI)

Mean Decrease in Impurity: Total reduction in Gini impurity by feature j across all trees

\text{Importance}(j) = \sum_{t=1}^{T} \sum_{\text{node } n \text{ splits on } j} p(n) \Delta \text{Gini}(n)

• p(n) = fraction of samples reaching node n
• ΔGini(n) = impurity decrease from split
• Fast to compute (available during training)
• Biased toward high-cardinality features

Permutation Importance

Algorithm:

Train model, get baseline score S_0 on validation data
For each feature j:
• Permute values of feature j (break relationship with target)
• Compute score S_j on permuted data
• Importance(j) = S_0 - S_j

Model-agnostic, unbiased, but computationally expensive (requires n_features × n_permutations predictions)

SHAP Values

SHapley Additive exPlanations: Game-theoretic approach to feature importance

\phi_j = \sum_{S \subseteq F \setminus \{j\}} \frac{|S|!(|F|-|S|-1)!}{|F|!}[f(S \cup \{j\}) - f(S)]

• Average marginal contribution of feature j across all possible feature subsets
• Satisfies desirable axioms (efficiency, symmetry, dummy, additivity)
• TreeSHAP: efficient algorithm for tree ensembles

Boruta Algorithm

All-Relevant Feature Selection with Shadow Features

Algorithm

Create shadow features: permuted copies of all original features
Train Random Forest on extended dataset (original + shadow)
Compute importance Z_j for all features
Find max importance among shadow features: Z_shadow_max
For each original feature j:
• If Z_j significantly > Z_shadow_max: mark as important
• If Z_j significantly < Z_shadow_max: mark as unimportant
• Otherwise: tentative
Remove confirmed unimportant features
Repeat until all features decided or max iterations

Statistical Test

Use binomial test to check if feature consistently beats shadow features across iterations

Controls false positive rate, tends to select more features than other methods (all-relevant vs minimal-optimal)

Method Comparison

Feature Selection Methods Summary

Method	Type	Complexity	Interactions	Best Use Case
Mutual Information	Filter	O(nd)	No	Fast screening, non-linear
Chi²/ANOVA	Filter	O(nd)	No	Categorical/continuous targets
LASSO	Embedded	O(nd²)	Limited	Linear models, automatic
RFE	Wrapper	O(d·T_model)	Yes	High accuracy, small d
RF Importance	Embedded	O(ndT log n)	Yes	Tree models, interpretability
Boruta	Wrapper	O(I·ndT log n)	Yes	All-relevant features

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

What is the main goal of feature selection?

Not attempted

2

Filter methods evaluate features based on:

Not attempted

3

Wrapper methods:

Not attempted

4

In forward selection:

Not attempted

5

Embedded methods:

Not attempted

6

LASSO regression performs feature selection by:

Not attempted

7

Mutual information measures:

Not attempted

8

Recursive Feature Elimination (RFE):

Not attempted

9

Which method is fastest for high-dimensional data?

Not attempted

10

Feature importance from Random Forest is an example of:

Not attempted