MathIsimple

Subset Search and Evaluation

Master the fundamental framework of feature selection. Learn how to search for optimal feature subsets and evaluate their quality using information-theoretic measures.

Module 1 of 8
Intermediate
90-120 min

What is Feature Selection?

Feature selection is the process of selecting a subset of relevant features from the original feature set for use in model construction. The goal is to identify and remove irrelevant and redundant features that do not contribute to the learning task.

Core Challenge

With dd features, there are 2d2^d possible feature subsets. Exhaustive search is computationally infeasible for large dd. Therefore, we need efficient search strategies combined with evaluation criteria.

Benefits

  • • Reduces overfitting
  • • Improves model interpretability
  • • Speeds up training
  • • Reduces storage requirements
  • • Alleviates curse of dimensionality

Challenges

  • • Combinatorial explosion
  • • Need efficient search
  • • Evaluation criteria selection
  • • Trade-off: quality vs. efficiency
  • • Feature interactions

Types of Features

Understanding feature types helps us design appropriate selection strategies:

Relevant Features

Features that are useful for the current learning task. For example, in predicting house prices, features like square footage, number of bedrooms, and location are relevant.

Example: In credit approval, income, credit score, and employment status are relevant features for predicting loan approval.

Irrelevant Features

Features that have no relationship with the learning task. These should be removed as they add noise and increase computational cost.

Example: In house price prediction, the color of the front door is typically irrelevant (unless it's a specific market preference).

Redundant Features

Features whose information can be inferred from other features. While not harmful, they increase dimensionality without adding value.

Example: If we have both "total square footage" and "living area + basement area", one is redundant if they always sum to the total.

The Core Framework: Search + Evaluation

Feature selection consists of two essential components that work together:

1. Subset Search

A strategy to generate candidate feature subsets. Since exhaustive search is infeasible, we use greedy strategies that incrementally build or reduce feature sets.

Goal: Efficiently explore the space of 2d2^d possible subsets.

2. Subset Evaluation

A criterion to assess the quality of candidate feature subsets. The evaluation measures how well the subset distinguishes between different classes or predicts the target.

Goal: Quantify how well features align with sample labels.

Key Insight

The combination of search strategy and evaluation criterion determines the feature selection method. Different combinations lead to filter methods, wrapper methods, and embedded methods, each with different trade-offs between efficiency and performance.

Subset Search Strategies

Greedy search strategies that efficiently explore the feature subset space:

Strategy 1

Forward Search

Start with an empty feature set and iteratively add the feature that most improves the evaluation criterion.

Algorithm:

  1. Initialize: A=A = \emptyset
  2. For each iteration:
  3. Evaluate all features not in AA
  4. Add feature ff^* with best evaluation
  5. Update: A=A{f}A = A \cup \{f^*\}
  6. Stop when no improvement or desired size reached

Advantage: Focuses on adding useful features. Good when most features are irrelevant.

Strategy 2

Backward Search

Start with the full feature set and iteratively remove the feature whose removal least degrades the evaluation criterion.

Algorithm:

  1. Initialize: A={f1,f2,,fd}A = \{f_1, f_2, \ldots, f_d\}
  2. For each iteration:
  3. Evaluate removing each feature from AA
  4. Remove feature ff^* with smallest degradation
  5. Update: A=A{f}A = A \setminus \{f^*\}
  6. Stop when degradation exceeds threshold or desired size reached

Advantage: Better at handling feature interactions. Good when most features are relevant.

Strategy 3

Bidirectional Search

Simultaneously perform forward and backward search, balancing the addition of useful features and removal of redundant ones.

Algorithm:

  1. Initialize: AA (can start empty or full)
  2. For each iteration:
  3. Try adding best candidate feature
  4. Try removing worst candidate feature
  5. Choose operation (add/remove) with better improvement
  6. Stop when no improvement possible

Advantage: More flexible, can handle both irrelevant and redundant features effectively.

Subset Evaluation Methods

Evaluation criteria measure how well a feature subset distinguishes between different classes. The core idea is to compare the "true partition" (based on labels) with the "data partition" (based on features).

Information Entropy

The entropy of dataset DD measures the uncertainty about sample labels:

Ent(D)=k=1Ypklog2pkEnt(D) = -\sum_{k=1}^{|\mathcal{Y}|} p_k \log_2 p_k

where pkp_k is the proportion of samples belonging to class kkin DD, and Y|\mathcal{Y}| is the number of classes.

Interpretation: Lower entropy means less uncertainty (more pure classes). Entropy is 0 when all samples belong to the same class, and maximum when classes are equally distributed.

Information Gain

Information gain measures how much the entropy decreases after partitioning the data using feature subset AA:

Gain(A)=Ent(D)v=1VDvDEnt(Dv)Gain(A) = Ent(D) - \sum_{v=1}^V \frac{|D^v|}{|D|} Ent(D^v)

where DvD^v is the vv-th subset after partitioning by AA, and VV is the number of partitions.

Interpretation: Higher information gain means the feature subset better distinguishes between classes. A good feature subset should produce partitions with low entropy (high purity).

Core Evaluation Principle

The evaluation criterion compares how well the feature subset's data partition aligns with the true label partition. The smaller the difference, the better the feature subset. Information gain quantifies this alignment: features that create partitions matching the class structure have high information gain.

Practical Example: Credit Approval

Consider a credit approval dataset with features: Income, Age, Employment Status, Credit Score, and Loan Amount. We want to select features for predicting loan approval.

Sample Dataset (5 samples)

IncomeAgeCredit ScoreApproved
High35750Yes
High42680Yes
Low28600No
Medium50720Yes
Low25550No

Step 1: Calculate Base Entropy

Dataset has 3 "Yes" and 2 "No" approvals:

Ent(D)=35log23525log2250.971Ent(D) = -\frac{3}{5}\log_2\frac{3}{5} - \frac{2}{5}\log_2\frac{2}{5} \approx 0.971

Step 2: Evaluate Feature "Income"

Partition by Income (High/Medium/Low):

• High: 2 samples, both "Yes" → Entropy = 0
• Medium: 1 sample, "Yes" → Entropy = 0
• Low: 2 samples, both "No" → Entropy = 0

Gain(Income)=0.971(250+150+250)=0.971Gain(Income) = 0.971 - \left(\frac{2}{5} \cdot 0 + \frac{1}{5} \cdot 0 + \frac{2}{5} \cdot 0\right) = 0.971

Income perfectly separates classes → Maximum information gain!

Step 3: Forward Search Result

Forward search would select "Income" first due to highest information gain. If we continue, we might find that adding "Credit Score" further improves the partition, while "Age" might add less value.

Key Takeaways

Feature selection requires both search strategy (how to explore subsets) and evaluation criterion (how to assess quality).

Greedy search strategies (forward, backward, bidirectional) efficiently explore the exponential search space.

Information gain measures how well features distinguish between classes by comparing data partitions with label partitions.

Forward search is good when most features are irrelevant; backward search handles feature interactions better.

The combination of search and evaluation forms the foundation for filter, wrapper, and embedded feature selection methods.

Feature selection alleviates the curse of dimensionality and improves model interpretability and generalization.