MathIsimple

Wrapper Methods: LVW

Master wrapper methods that tailor feature selection to specific learning algorithms. Learn the Las Vegas Wrapper (LVW) algorithm using random subset search and cross-validation to find optimal feature subsets.

Module 3 of 8
Intermediate to Advanced
100-120 min

What are Wrapper Methods?

Wrapper methods use the performance of a learning algorithm as the evaluation criterion for feature subsets. Unlike filter methods that evaluate features independently, wrapper methods train models on candidate feature subsets and select features that maximize the learner's performance.

Core Principle

Wrapper methods directly optimize for the learning algorithm's performance. The feature subset that works best with a specific learner (e.g., decision tree, SVM) may differ from the subset that works best with another learner. This "tailored" approach often yields better performance than filter methods.

Advantages

  • • Optimized for specific learner
  • • Considers feature interactions
  • • Better generalization performance
  • • Accounts for learner bias
  • • Can find optimal subsets

Limitations

  • • Computationally expensive
  • • Requires multiple model trainings
  • • Risk of overfitting
  • • Learner-specific (not general)
  • • Time-consuming for large datasets

Las Vegas Wrapper (LVW) Algorithm

LVW uses a random search strategy to explore feature subsets. It evaluates each candidate subset by training the learning algorithm and measuring its cross-validation error. The algorithm continues until no improvement is found for a specified number of consecutive iterations.

Input

Algorithm Inputs

  • • Dataset DD
  • • Full feature set AA
  • • Learning algorithm L\mathfrak{L}
  • • Stopping parameter TT (max consecutive non-improvements)
Step 1

Initialize

Initialize best error and feature set:

E=E = \infty (best error so far)
A=AA^* = A (best feature set, start with all features)
t=0t = 0 (consecutive non-improvement counter)

Step 2

Random Subset Generation

While t<Tt < T:

  1. Randomly generate a candidate feature subset AA'
  2. Use cross-validation to evaluate AA':
    E=CrossValidation(D,A,L)E' = \text{CrossValidation}(D, A', \mathfrak{L})
Step 3

Update Best Subset

If E<EE' < E or (E=EE' = E andA<A|A'| < |A^*|):

E=EE = E'
A=AA^* = A'
t=0t = 0 (reset counter)

Otherwise: t=t+1t = t + 1 (increment counter)

Output

Return Best Subset

Output the optimal feature subset AA^* with errorEE.

Key Design Choices

  • Random Search: Avoids getting stuck in local optima, but may require many iterations to find good solutions.
  • Tie-Breaking: When errors are equal, prefer smaller feature subsets (parsimony principle).
  • Stopping Criterion: Stop after TT consecutive iterations without improvement. This balances exploration time with convergence.
  • Cross-Validation: Uses cross-validation error to get reliable performance estimates and avoid overfitting to the training set.

Cross-Validation Evaluation

LVW uses cross-validation to evaluate feature subsets. This provides a reliable estimate of generalization performance and reduces overfitting risk.

k-Fold Cross-Validation Process

  1. Split dataset DD into kk folds (typically k=5k = 5 or k=10k = 10)
  2. For each fold i=1,2,,ki = 1, 2, \ldots, k:
    • Use fold ii as validation set
    • Use remaining k1k-1 folds as training set
    • Train learner L\mathfrak{L} on training set with features AA'
    • Evaluate error EiE_i on validation set
  3. Average errors: E=1ki=1kEiE' = \frac{1}{k}\sum_{i=1}^k E_i

Why Cross-Validation?

  • Reliable Estimates: Cross-validation provides unbiased estimates of generalization error by testing on held-out data.
  • Reduces Overfitting: By evaluating on multiple validation sets, we avoid selecting features that only work well on a single train-test split.
  • Better Feature Selection: Features selected based on cross-validation error tend to generalize better to new data.

Practical Example: Credit Scoring

Consider a credit scoring dataset with features: Income, Age, Employment, Credit Score, Debt Ratio, Loan Amount, and Loan History. We want to select features for a decision tree classifier to predict loan default.

Initialization

Start with all 7 features. Train decision tree with 5-fold cross-validation:E=0.18E = 0.18 (18% error rate).

Iteration 1: Random Subset {Income, Credit Score, Debt Ratio}

Train decision tree with 3 features, 5-fold CV: E=0.15E' = 0.15

E<EE' < E → Update: E=0.15E = 0.15,A={Income, Credit Score, Debt Ratio}A^* = \{\text{Income, Credit Score, Debt Ratio}\}, t=0t = 0

Iteration 2-5: No Improvement

Random subsets {Age, Employment}, {Loan Amount}, etc. all have E0.15E' \geq 0.15. Counter increments: t=1,2,3,4t = 1, 2, 3, 4

Iteration 6: Better Subset {Income, Credit Score, Debt Ratio, Loan History}

Train with 4 features, 5-fold CV: E=0.12E' = 0.12

E<EE' < E → Update: E=0.12E = 0.12,A={Income, Credit Score, Debt Ratio, Loan History}A^* = \{\text{Income, Credit Score, Debt Ratio, Loan History}\},t=0t = 0

Final Result

After T=10T = 10 consecutive non-improvements, algorithm stops. Optimal subset: {Income, Credit Score, Debt Ratio, Loan History} with 12% error rate, reducing from 7 features to 4 while improving performance.

Wrapper vs Filter Methods

Understanding the trade-offs helps choose the right method:

AspectFilter MethodsWrapper Methods
SpeedVery fast (linear time)Slow (requires model training)
PerformanceGood, general-purposeBetter, learner-specific
Feature InteractionsOften ignoredAutomatically considered
Overfitting RiskLowHigher (mitigated by CV)
ScalabilityExcellentLimited by training time
Use CaseInitial screening, large datasetsFinal optimization, small-medium datasets

Key Takeaways

Wrapper methods optimize feature selection for specific learning algorithms, often achieving better performance than filter methods.

LVW uses random subset search to avoid local optima, evaluating subsets with cross-validation for reliable performance estimates.

The algorithm stops after TT consecutive iterations without improvement, balancing exploration and convergence.

Cross-validation provides unbiased error estimates and reduces overfitting risk in feature selection.

Wrapper methods are computationally expensive but valuable when performance is critical and dataset size is manageable.

The tie-breaking rule (prefer smaller subsets when errors are equal) implements the parsimony principle.