Machine Learning/Learning Center/Feature Selection & Sparse Learning/Filter Methods: Relief

Filter Methods: Relief and ReliefF

Master filter-based feature selection that operates independently of learning algorithms. Learn Relief for binary classification and ReliefF for multi-class problems using near-hit and near-miss concepts.

Module 2 of 8

Intermediate

100-120 min

What are Filter Methods?

Filter methods perform feature selection as a preprocessing step, independent of any learning algorithm. They evaluate features based on intrinsic properties of the data, such as correlation with the target variable or information content, without training a model.

Key Characteristics

• Fast: No model training required, only data statistics
• General: Selected features work with any learning algorithm
• Scalable: Linear time complexity in number of samples and features
• Limitation: May not select features optimal for a specific learning algorithm

Advantages

• Very fast computation
• No overfitting risk
• Works with any learner
• Good for initial screening
• Low computational cost

Limitations

• Ignores feature interactions
• Not tailored to specific learner
• May miss optimal features
• Assumes feature independence
• Less accurate than wrapper methods

Relief Algorithm (Binary Classification)

Relief is a filter method designed for binary classification. It evaluates features by how well they distinguish between samples from different classes.

Key Concepts

Near-Hit (nh)

For a sample $x_i$ , the near-hit $x_{i,nh}$ is the nearest sample from the same class.

Intuition: A good feature should have similar values for samples in the same class.

Near-Miss (nm)

For a sample $x_i$ , the near-miss $x_{i,nm}$ is the nearest sample from a different class.

Intuition: A good feature should have different values for samples from different classes.

Difference Function

The difference function $diff(x_a^j, x_b^j)$ measures how different two samples are on feature $j$ :

For discrete/categorical features:

diff(x_a^j, x_b^j) = \begin{cases} 0 & \text{if } x_a^j = x_b^j \\ 1 & \text{if } x_a^j \neq x_b^j \end{cases}

For continuous features (normalized to [0,1]):

diff(x_a^j, x_b^j) = |x_a^j - x_b^j|

Relevance Statistics

For each feature $j$ , Relief computes a relevance statistic $\delta^j$ :

\delta^j = \sum_i \left[ -diff(x_i^j, x_{i,nh}^j)^2 + diff(x_i^j, x_{i,nm}^j)^2 \right]

where the sum is over all samples $i$ in the dataset.

Interpretation: Higher $\delta^j$ means feature $j$ is more relevant. The feature is good if:

• Same-class samples are similar (small near-hit difference)
• Different-class samples are different (large near-miss difference)

Relief Algorithm Steps

The complete Relief algorithm for binary classification:

Step 1

Initialize

Initialize relevance statistics: $\delta^j = 0$ for all features $j = 1, 2, \ldots, d$ .

Step 2

Sample Iteration

For $t = 1, 2, \ldots, T$ iterations (typically $T = m$ where $m$ is the number of samples):

Randomly select a sample $x_i$
Find near-hit $x_{i,nh}$ (nearest same-class sample)
Find near-miss $x_{i,nm}$ (nearest different-class sample)
For each feature $j$ , update:
$\delta^j \leftarrow \delta^j - diff(x_i^j, x_{i,nh}^j)^2 + diff(x_i^j, x_{i,nm}^j)^2$

Step 3

Feature Ranking

Rank features by $\delta^j$ in descending order. Select top $k$ features or features with $\delta^j$ above a threshold.

ReliefF: Extension to Multi-Class

ReliefF extends Relief to handle multi-class classification problems. Instead of a single near-miss, it considers near-misses from each different class.

Key Modification

For sample $x_i$ with class $k$ , ReliefF finds:

• Near-hit: $x_{i,nh}$ (nearest same-class sample)
• Near-misses: For each class $l \neq k$ , find $x_{i,l,nm}$ (nearest sample from class $l$ )

ReliefF Relevance Statistics

The relevance statistic for feature $j$ :

\delta^j = \sum_i \left[ -diff(x_i^j, x_{i,nh}^j)^2 + \sum_{l \neq k} p_l \cdot diff(x_i^j, x_{i,l,nm}^j)^2 \right]

where $p_l$ is the proportion of samples belonging to class $l$ in the dataset, and $k$ is the class of sample $x_i$ .

Interpretation: The weighted sum ensures that classes with more samples contribute more to the relevance statistic. This handles class imbalance naturally.

Practical Example: Medical Diagnosis

Consider a medical diagnosis dataset with features: Age, Blood Pressure, Cholesterol, and Glucose. We want to predict disease presence (Yes/No).

Sample Dataset (5 samples, normalized features)

Age	BP	Cholesterol	Glucose	Disease
0.6	0.8	0.7	0.9	Yes
0.5	0.7	0.6	0.8	Yes
0.3	0.2	0.3	0.2	No
0.4	0.3	0.4	0.3	No
0.7	0.9	0.8	0.95	Yes

Step 1: Process Sample 1 (Disease=Yes)

• Near-hit: Sample 2 (same class, closest)
• Near-miss: Sample 3 (different class, closest)

For Glucose feature: $diff(0.9, 0.8)^2 = 0.01$ (near-hit), $diff(0.9, 0.2)^2 = 0.49$ (near-miss)
Update: $\delta^{Glucose} += -0.01 + 0.49 = 0.48$

Step 2: After Processing All Samples

Glucose accumulates the highest relevance statistic because it best distinguishes between disease and no-disease cases. Blood Pressure and Cholesterol also show high relevance, while Age may be less discriminative.

Step 3: Feature Selection

Select top 2-3 features based on $\delta^j$ values. This reduces dimensionality while preserving the most discriminative information.

Computational Efficiency

Relief and ReliefF are highly efficient filter methods:

Time Complexity

O(T \cdot m \cdot d)

where:

• $T$ : Number of iterations (typically $T = m$ )
• $m$ : Number of samples
• $d$ : Number of features

Key Advantage: Linear in all dimensions! This makes Relief/ReliefF extremely fast for large-scale datasets, much faster than wrapper methods that require training models for each candidate subset.

Key Takeaways

Filter methods evaluate features independently of learning algorithms, making them fast and general-purpose.

Relief uses near-hit and near-miss concepts to measure feature relevance: good features have similar values for same-class samples and different values for different-class samples.

ReliefF extends Relief to multi-class problems by considering near-misses from each different class, weighted by class proportions.

The relevance statistic $\delta^j$ quantifies feature importance: higher values indicate more discriminative features.

Relief/ReliefF have linear time complexity $O(Tmd)$ , making them highly efficient for large-scale feature selection.

Filter methods are ideal for initial feature screening before applying more expensive wrapper or embedded methods.

Next Module