Machine Learning/Learning Center/Feature Selection & Sparse Learning/Wrapper Methods: LVW

Wrapper Methods: LVW

Master wrapper methods that tailor feature selection to specific learning algorithms. Learn the Las Vegas Wrapper (LVW) algorithm using random subset search and cross-validation to find optimal feature subsets.

Module 3 of 8

Intermediate to Advanced

100-120 min

What are Wrapper Methods?

Wrapper methods use the performance of a learning algorithm as the evaluation criterion for feature subsets. Unlike filter methods that evaluate features independently, wrapper methods train models on candidate feature subsets and select features that maximize the learner's performance.

Core Principle

Wrapper methods directly optimize for the learning algorithm's performance. The feature subset that works best with a specific learner (e.g., decision tree, SVM) may differ from the subset that works best with another learner. This "tailored" approach often yields better performance than filter methods.

Advantages

• Optimized for specific learner
• Considers feature interactions
• Better generalization performance
• Accounts for learner bias
• Can find optimal subsets

Limitations

• Computationally expensive
• Requires multiple model trainings
• Risk of overfitting
• Learner-specific (not general)
• Time-consuming for large datasets

Las Vegas Wrapper (LVW) Algorithm

LVW uses a random search strategy to explore feature subsets. It evaluates each candidate subset by training the learning algorithm and measuring its cross-validation error. The algorithm continues until no improvement is found for a specified number of consecutive iterations.

Input

Algorithm Inputs

• Dataset $D$
• Full feature set $A$
• Learning algorithm $\mathfrak{L}$
• Stopping parameter $T$ (max consecutive non-improvements)

Step 1

Initialize

Initialize best error and feature set:

$E = \infty$ (best error so far)
$A^* = A$ (best feature set, start with all features)
$t = 0$ (consecutive non-improvement counter)

Step 2

Random Subset Generation

While $t < T$ :

Randomly generate a candidate feature subset $A'$
Use cross-validation to evaluate $A'$ :
$E' = \text{CrossValidation}(D, A', \mathfrak{L})$

Step 3

Update Best Subset

If $E' < E$ or ( $E' = E$ and $|A'| < |A^*|$ ):

$E = E'$
$A^* = A'$
$t = 0$ (reset counter)

Otherwise: $t = t + 1$ (increment counter)

Output

Return Best Subset

Output the optimal feature subset $A^*$ with error $E$ .

Key Design Choices

Random Search: Avoids getting stuck in local optima, but may require many iterations to find good solutions.
Tie-Breaking: When errors are equal, prefer smaller feature subsets (parsimony principle).
Stopping Criterion: Stop after $T$ consecutive iterations without improvement. This balances exploration time with convergence.
Cross-Validation: Uses cross-validation error to get reliable performance estimates and avoid overfitting to the training set.

Cross-Validation Evaluation

LVW uses cross-validation to evaluate feature subsets. This provides a reliable estimate of generalization performance and reduces overfitting risk.

k-Fold Cross-Validation Process

Split dataset $D$ into $k$ folds (typically $k = 5$ or $k = 10$ )
For each fold $i = 1, 2, \ldots, k$ :
- Use fold $i$ as validation set
- Use remaining $k-1$ folds as training set
- Train learner $\mathfrak{L}$ on training set with features $A'$
- Evaluate error $E_i$ on validation set
Average errors: $E' = \frac{1}{k}\sum_{i=1}^k E_i$

Why Cross-Validation?

Reliable Estimates: Cross-validation provides unbiased estimates of generalization error by testing on held-out data.
Reduces Overfitting: By evaluating on multiple validation sets, we avoid selecting features that only work well on a single train-test split.
Better Feature Selection: Features selected based on cross-validation error tend to generalize better to new data.

Practical Example: Credit Scoring

Consider a credit scoring dataset with features: Income, Age, Employment, Credit Score, Debt Ratio, Loan Amount, and Loan History. We want to select features for a decision tree classifier to predict loan default.

Initialization

Start with all 7 features. Train decision tree with 5-fold cross-validation: $E = 0.18$ (18% error rate).

Iteration 1: Random Subset {Income, Credit Score, Debt Ratio}

Train decision tree with 3 features, 5-fold CV: $E' = 0.15$

$E' < E$ → Update: $E = 0.15$ , $A^* = \{\text{Income, Credit Score, Debt Ratio}\}$ , $t = 0$

Iteration 2-5: No Improvement

Random subsets {Age, Employment}, {Loan Amount}, etc. all have $E' \geq 0.15$ . Counter increments: $t = 1, 2, 3, 4$

Iteration 6: Better Subset {Income, Credit Score, Debt Ratio, Loan History}

Train with 4 features, 5-fold CV: $E' = 0.12$

$E' < E$ → Update: $E = 0.12$ , $A^* = \{\text{Income, Credit Score, Debt Ratio, Loan History}\}$ , $t = 0$

Final Result

After $T = 10$ consecutive non-improvements, algorithm stops. Optimal subset: {Income, Credit Score, Debt Ratio, Loan History} with 12% error rate, reducing from 7 features to 4 while improving performance.

Wrapper vs Filter Methods

Understanding the trade-offs helps choose the right method:

Aspect	Filter Methods	Wrapper Methods
Speed	Very fast (linear time)	Slow (requires model training)
Performance	Good, general-purpose	Better, learner-specific
Feature Interactions	Often ignored	Automatically considered
Overfitting Risk	Low	Higher (mitigated by CV)
Scalability	Excellent	Limited by training time
Use Case	Initial screening, large datasets	Final optimization, small-medium datasets

Key Takeaways

Wrapper methods optimize feature selection for specific learning algorithms, often achieving better performance than filter methods.

LVW uses random subset search to avoid local optima, evaluating subsets with cross-validation for reliable performance estimates.

The algorithm stops after $T$ consecutive iterations without improvement, balancing exploration and convergence.

Cross-validation provides unbiased error estimates and reduces overfitting risk in feature selection.

Wrapper methods are computationally expensive but valuable when performance is critical and dataset size is manageable.

The tie-breaking rule (prefer smaller subsets when errors are equal) implements the parsimony principle.

Next Module