MathIsimple

Filter Methods: Relief and ReliefF

Master filter-based feature selection that operates independently of learning algorithms. Learn Relief for binary classification and ReliefF for multi-class problems using near-hit and near-miss concepts.

Module 2 of 8
Intermediate
100-120 min

What are Filter Methods?

Filter methods perform feature selection as a preprocessing step, independent of any learning algorithm. They evaluate features based on intrinsic properties of the data, such as correlation with the target variable or information content, without training a model.

Key Characteristics

  • Fast: No model training required, only data statistics
  • General: Selected features work with any learning algorithm
  • Scalable: Linear time complexity in number of samples and features
  • Limitation: May not select features optimal for a specific learning algorithm

Advantages

  • • Very fast computation
  • • No overfitting risk
  • • Works with any learner
  • • Good for initial screening
  • • Low computational cost

Limitations

  • • Ignores feature interactions
  • • Not tailored to specific learner
  • • May miss optimal features
  • • Assumes feature independence
  • • Less accurate than wrapper methods

Relief Algorithm (Binary Classification)

Relief is a filter method designed for binary classification. It evaluates features by how well they distinguish between samples from different classes.

Key Concepts

Near-Hit (nh)

For a sample xix_i, the near-hitxi,nhx_{i,nh} is the nearest sample from the same class.

Intuition: A good feature should have similar values for samples in the same class.

Near-Miss (nm)

For a sample xix_i, the near-missxi,nmx_{i,nm} is the nearest sample from a different class.

Intuition: A good feature should have different values for samples from different classes.

Difference Function

The difference function diff(xaj,xbj)diff(x_a^j, x_b^j) measures how different two samples are on feature jj:

For discrete/categorical features:

diff(xaj,xbj)={0if xaj=xbj1if xajxbjdiff(x_a^j, x_b^j) = \begin{cases} 0 & \text{if } x_a^j = x_b^j \\ 1 & \text{if } x_a^j \neq x_b^j \end{cases}

For continuous features (normalized to [0,1]):

diff(xaj,xbj)=xajxbjdiff(x_a^j, x_b^j) = |x_a^j - x_b^j|

Relevance Statistics

For each feature jj, Relief computes a relevance statisticδj\delta^j:

δj=i[diff(xij,xi,nhj)2+diff(xij,xi,nmj)2]\delta^j = \sum_i \left[ -diff(x_i^j, x_{i,nh}^j)^2 + diff(x_i^j, x_{i,nm}^j)^2 \right]

where the sum is over all samples ii in the dataset.

Interpretation: Higher δj\delta^j means featurejj is more relevant. The feature is good if:

  • • Same-class samples are similar (small near-hit difference)
  • • Different-class samples are different (large near-miss difference)

Relief Algorithm Steps

The complete Relief algorithm for binary classification:

Step 1

Initialize

Initialize relevance statistics: δj=0\delta^j = 0 for all featuresj=1,2,,dj = 1, 2, \ldots, d.

Step 2

Sample Iteration

For t=1,2,,Tt = 1, 2, \ldots, T iterations (typicallyT=mT = m where mm is the number of samples):

  1. Randomly select a sample xix_i
  2. Find near-hit xi,nhx_{i,nh} (nearest same-class sample)
  3. Find near-miss xi,nmx_{i,nm} (nearest different-class sample)
  4. For each feature jj, update:
    δjδjdiff(xij,xi,nhj)2+diff(xij,xi,nmj)2\delta^j \leftarrow \delta^j - diff(x_i^j, x_{i,nh}^j)^2 + diff(x_i^j, x_{i,nm}^j)^2
Step 3

Feature Ranking

Rank features by δj\delta^j in descending order. Select topkk features or features with δj\delta^jabove a threshold.

ReliefF: Extension to Multi-Class

ReliefF extends Relief to handle multi-class classification problems. Instead of a single near-miss, it considers near-misses from each different class.

Key Modification

For sample xix_i with class kk, ReliefF finds:

  • Near-hit: xi,nhx_{i,nh} (nearest same-class sample)
  • Near-misses: For each class lkl \neq k, find xi,l,nmx_{i,l,nm} (nearest sample from class ll)

ReliefF Relevance Statistics

The relevance statistic for feature jj:

δj=i[diff(xij,xi,nhj)2+lkpldiff(xij,xi,l,nmj)2]\delta^j = \sum_i \left[ -diff(x_i^j, x_{i,nh}^j)^2 + \sum_{l \neq k} p_l \cdot diff(x_i^j, x_{i,l,nm}^j)^2 \right]

where plp_l is the proportion of samples belonging to classll in the dataset, and kk is the class of sample xix_i.

Interpretation: The weighted sum ensures that classes with more samples contribute more to the relevance statistic. This handles class imbalance naturally.

Practical Example: Medical Diagnosis

Consider a medical diagnosis dataset with features: Age, Blood Pressure, Cholesterol, and Glucose. We want to predict disease presence (Yes/No).

Sample Dataset (5 samples, normalized features)

AgeBPCholesterolGlucoseDisease
0.60.80.70.9Yes
0.50.70.60.8Yes
0.30.20.30.2No
0.40.30.40.3No
0.70.90.80.95Yes

Step 1: Process Sample 1 (Disease=Yes)

• Near-hit: Sample 2 (same class, closest)
• Near-miss: Sample 3 (different class, closest)

For Glucose feature: diff(0.9,0.8)2=0.01diff(0.9, 0.8)^2 = 0.01 (near-hit),diff(0.9,0.2)2=0.49diff(0.9, 0.2)^2 = 0.49 (near-miss)
Update: δGlucose+=0.01+0.49=0.48\delta^{Glucose} += -0.01 + 0.49 = 0.48

Step 2: After Processing All Samples

Glucose accumulates the highest relevance statistic because it best distinguishes between disease and no-disease cases. Blood Pressure and Cholesterol also show high relevance, while Age may be less discriminative.

Step 3: Feature Selection

Select top 2-3 features based on δj\delta^j values. This reduces dimensionality while preserving the most discriminative information.

Computational Efficiency

Relief and ReliefF are highly efficient filter methods:

Time Complexity

O(Tmd)O(T \cdot m \cdot d)

where:

  • TT: Number of iterations (typically T=mT = m)
  • mm: Number of samples
  • dd: Number of features

Key Advantage: Linear in all dimensions! This makes Relief/ReliefF extremely fast for large-scale datasets, much faster than wrapper methods that require training models for each candidate subset.

Key Takeaways

Filter methods evaluate features independently of learning algorithms, making them fast and general-purpose.

Relief uses near-hit and near-miss concepts to measure feature relevance: good features have similar values for same-class samples and different values for different-class samples.

ReliefF extends Relief to multi-class problems by considering near-misses from each different class, weighted by class proportions.

The relevance statistic δj\delta^j quantifies feature importance: higher values indicate more discriminative features.

Relief/ReliefF have linear time complexity O(Tmd)O(Tmd), making them highly efficient for large-scale feature selection.

Filter methods are ideal for initial feature screening before applying more expensive wrapper or embedded methods.