Machine Learning/Learning Center/Feature Selection & Sparse Learning/Subset Search and Evaluation

Subset Search and Evaluation

Master the fundamental framework of feature selection. Learn how to search for optimal feature subsets and evaluate their quality using information-theoretic measures.

Module 1 of 8

Intermediate

90-120 min

What is Feature Selection?

Feature selection is the process of selecting a subset of relevant features from the original feature set for use in model construction. The goal is to identify and remove irrelevant and redundant features that do not contribute to the learning task.

Core Challenge

With $d$ features, there are $2^d$ possible feature subsets. Exhaustive search is computationally infeasible for large $d$ . Therefore, we need efficient search strategies combined with evaluation criteria.

Benefits

• Reduces overfitting
• Improves model interpretability
• Speeds up training
• Reduces storage requirements
• Alleviates curse of dimensionality

Challenges

• Combinatorial explosion
• Need efficient search
• Evaluation criteria selection
• Trade-off: quality vs. efficiency
• Feature interactions

Types of Features

Understanding feature types helps us design appropriate selection strategies:

Relevant Features

Features that are useful for the current learning task. For example, in predicting house prices, features like square footage, number of bedrooms, and location are relevant.

Example: In credit approval, income, credit score, and employment status are relevant features for predicting loan approval.

Irrelevant Features

Features that have no relationship with the learning task. These should be removed as they add noise and increase computational cost.

Example: In house price prediction, the color of the front door is typically irrelevant (unless it's a specific market preference).

Redundant Features

Features whose information can be inferred from other features. While not harmful, they increase dimensionality without adding value.

Example: If we have both "total square footage" and "living area + basement area", one is redundant if they always sum to the total.

The Core Framework: Search + Evaluation

Feature selection consists of two essential components that work together:

1. Subset Search

A strategy to generate candidate feature subsets. Since exhaustive search is infeasible, we use greedy strategies that incrementally build or reduce feature sets.

Goal: Efficiently explore the space of $2^d$ possible subsets.

2. Subset Evaluation

A criterion to assess the quality of candidate feature subsets. The evaluation measures how well the subset distinguishes between different classes or predicts the target.

Goal: Quantify how well features align with sample labels.

Key Insight

The combination of search strategy and evaluation criterion determines the feature selection method. Different combinations lead to filter methods, wrapper methods, and embedded methods, each with different trade-offs between efficiency and performance.

Subset Search Strategies

Greedy search strategies that efficiently explore the feature subset space:

Strategy 1

Forward Search

Start with an empty feature set and iteratively add the feature that most improves the evaluation criterion.

Algorithm:

Initialize: $A = \emptyset$
For each iteration:
Evaluate all features not in $A$
Add feature $f^*$ with best evaluation
Update: $A = A \cup \{f^*\}$
Stop when no improvement or desired size reached

Advantage: Focuses on adding useful features. Good when most features are irrelevant.

Strategy 2

Backward Search

Start with the full feature set and iteratively remove the feature whose removal least degrades the evaluation criterion.

Algorithm:

Initialize: $A = \{f_1, f_2, \ldots, f_d\}$
For each iteration:
Evaluate removing each feature from $A$
Remove feature $f^*$ with smallest degradation
Update: $A = A \setminus \{f^*\}$
Stop when degradation exceeds threshold or desired size reached

Advantage: Better at handling feature interactions. Good when most features are relevant.

Strategy 3

Bidirectional Search

Simultaneously perform forward and backward search, balancing the addition of useful features and removal of redundant ones.

Algorithm:

Initialize: $A$ (can start empty or full)
For each iteration:
Try adding best candidate feature
Try removing worst candidate feature
Choose operation (add/remove) with better improvement
Stop when no improvement possible

Advantage: More flexible, can handle both irrelevant and redundant features effectively.

Subset Evaluation Methods

Evaluation criteria measure how well a feature subset distinguishes between different classes. The core idea is to compare the "true partition" (based on labels) with the "data partition" (based on features).

Information Entropy

The entropy of dataset $D$ measures the uncertainty about sample labels:

Ent(D) = -\sum_{k=1}^{|\mathcal{Y}|} p_k \log_2 p_k

where $p_k$ is the proportion of samples belonging to class $k$ in $D$ , and $|\mathcal{Y}|$ is the number of classes.

Interpretation: Lower entropy means less uncertainty (more pure classes). Entropy is 0 when all samples belong to the same class, and maximum when classes are equally distributed.

Information Gain

Information gain measures how much the entropy decreases after partitioning the data using feature subset $A$ :

Gain(A) = Ent(D) - \sum_{v=1}^V \frac{|D^v|}{|D|} Ent(D^v)

where $D^v$ is the $v$ -th subset after partitioning by $A$ , and $V$ is the number of partitions.

Interpretation: Higher information gain means the feature subset better distinguishes between classes. A good feature subset should produce partitions with low entropy (high purity).

Core Evaluation Principle

The evaluation criterion compares how well the feature subset's data partition aligns with the true label partition. The smaller the difference, the better the feature subset. Information gain quantifies this alignment: features that create partitions matching the class structure have high information gain.

Practical Example: Credit Approval

Consider a credit approval dataset with features: Income, Age, Employment Status, Credit Score, and Loan Amount. We want to select features for predicting loan approval.

Sample Dataset (5 samples)

Income	Age	Credit Score	Approved
High	35	750	Yes
High	42	680	Yes
Low	28	600	No
Medium	50	720	Yes
Low	25	550	No

Step 1: Calculate Base Entropy

Dataset has 3 "Yes" and 2 "No" approvals:

Ent(D) = -\frac{3}{5}\log_2\frac{3}{5} - \frac{2}{5}\log_2\frac{2}{5} \approx 0.971

Step 2: Evaluate Feature "Income"

Partition by Income (High/Medium/Low):

• High: 2 samples, both "Yes" → Entropy = 0
• Medium: 1 sample, "Yes" → Entropy = 0
• Low: 2 samples, both "No" → Entropy = 0

Gain(Income) = 0.971 - \left(\frac{2}{5} \cdot 0 + \frac{1}{5} \cdot 0 + \frac{2}{5} \cdot 0\right) = 0.971

Income perfectly separates classes → Maximum information gain!

Step 3: Forward Search Result

Forward search would select "Income" first due to highest information gain. If we continue, we might find that adding "Credit Score" further improves the partition, while "Age" might add less value.

Key Takeaways

Feature selection requires both search strategy (how to explore subsets) and evaluation criterion (how to assess quality).

Greedy search strategies (forward, backward, bidirectional) efficiently explore the exponential search space.

Information gain measures how well features distinguish between classes by comparing data partitions with label partitions.

Forward search is good when most features are irrelevant; backward search handles feature interactions better.

The combination of search and evaluation forms the foundation for filter, wrapper, and embedded feature selection methods.

Feature selection alleviates the curse of dimensionality and improves model interpretability and generalization.

Next Module