Master the fundamental framework of feature selection. Learn how to search for optimal feature subsets and evaluate their quality using information-theoretic measures.
Feature selection is the process of selecting a subset of relevant features from the original feature set for use in model construction. The goal is to identify and remove irrelevant and redundant features that do not contribute to the learning task.
With features, there are possible feature subsets. Exhaustive search is computationally infeasible for large . Therefore, we need efficient search strategies combined with evaluation criteria.
Understanding feature types helps us design appropriate selection strategies:
Features that are useful for the current learning task. For example, in predicting house prices, features like square footage, number of bedrooms, and location are relevant.
Example: In credit approval, income, credit score, and employment status are relevant features for predicting loan approval.
Features that have no relationship with the learning task. These should be removed as they add noise and increase computational cost.
Example: In house price prediction, the color of the front door is typically irrelevant (unless it's a specific market preference).
Features whose information can be inferred from other features. While not harmful, they increase dimensionality without adding value.
Example: If we have both "total square footage" and "living area + basement area", one is redundant if they always sum to the total.
Feature selection consists of two essential components that work together:
A strategy to generate candidate feature subsets. Since exhaustive search is infeasible, we use greedy strategies that incrementally build or reduce feature sets.
Goal: Efficiently explore the space of possible subsets.
A criterion to assess the quality of candidate feature subsets. The evaluation measures how well the subset distinguishes between different classes or predicts the target.
Goal: Quantify how well features align with sample labels.
The combination of search strategy and evaluation criterion determines the feature selection method. Different combinations lead to filter methods, wrapper methods, and embedded methods, each with different trade-offs between efficiency and performance.
Greedy search strategies that efficiently explore the feature subset space:
Start with an empty feature set and iteratively add the feature that most improves the evaluation criterion.
Algorithm:
Advantage: Focuses on adding useful features. Good when most features are irrelevant.
Start with the full feature set and iteratively remove the feature whose removal least degrades the evaluation criterion.
Algorithm:
Advantage: Better at handling feature interactions. Good when most features are relevant.
Simultaneously perform forward and backward search, balancing the addition of useful features and removal of redundant ones.
Algorithm:
Advantage: More flexible, can handle both irrelevant and redundant features effectively.
Evaluation criteria measure how well a feature subset distinguishes between different classes. The core idea is to compare the "true partition" (based on labels) with the "data partition" (based on features).
The entropy of dataset measures the uncertainty about sample labels:
where is the proportion of samples belonging to class in , and is the number of classes.
Interpretation: Lower entropy means less uncertainty (more pure classes). Entropy is 0 when all samples belong to the same class, and maximum when classes are equally distributed.
Information gain measures how much the entropy decreases after partitioning the data using feature subset :
where is the -th subset after partitioning by , and is the number of partitions.
Interpretation: Higher information gain means the feature subset better distinguishes between classes. A good feature subset should produce partitions with low entropy (high purity).
The evaluation criterion compares how well the feature subset's data partition aligns with the true label partition. The smaller the difference, the better the feature subset. Information gain quantifies this alignment: features that create partitions matching the class structure have high information gain.
Consider a credit approval dataset with features: Income, Age, Employment Status, Credit Score, and Loan Amount. We want to select features for predicting loan approval.
| Income | Age | Credit Score | Approved |
|---|---|---|---|
| High | 35 | 750 | Yes |
| High | 42 | 680 | Yes |
| Low | 28 | 600 | No |
| Medium | 50 | 720 | Yes |
| Low | 25 | 550 | No |
Dataset has 3 "Yes" and 2 "No" approvals:
Partition by Income (High/Medium/Low):
• High: 2 samples, both "Yes" → Entropy = 0
• Medium: 1 sample, "Yes" → Entropy = 0
• Low: 2 samples, both "No" → Entropy = 0
Income perfectly separates classes → Maximum information gain!
Forward search would select "Income" first due to highest information gain. If we continue, we might find that adding "Credit Score" further improves the partition, while "Age" might add less value.
Feature selection requires both search strategy (how to explore subsets) and evaluation criterion (how to assess quality).
Greedy search strategies (forward, backward, bidirectional) efficiently explore the exponential search space.
Information gain measures how well features distinguish between classes by comparing data partitions with label partitions.
Forward search is good when most features are irrelevant; backward search handles feature interactions better.
The combination of search and evaluation forms the foundation for filter, wrapper, and embedded feature selection methods.
Feature selection alleviates the curse of dimensionality and improves model interpretability and generalization.