Machine Learning/Learning Center/Linear Models/Class Imbalance

Class Imbalance Problems

Learn to handle severely imbalanced datasets with undersampling, oversampling, SMOTE, and threshold moving techniques for fraud detection and rare disease screening

What is Class Imbalance?

Critical Challenge

Class imbalance occurs when the number of samples in different classes varies dramatically. In extreme cases, the minority class might represent only 0.01% to 1% of the dataset. Standard classifiers trained on imbalanced data tend to be biased toward the majority class, achieving high overall accuracy while failing to detect minority class samples.

Real-World Examples of Severe Imbalance

Fraud Detection

• Fraudulent transactions: ~0.1% to 2%
• Legitimate transactions: ~98% to 99.9%
• Imbalance ratio: 50:1 to 1000:1

Medical Screening

• Rare disease cases: ~0.01% to 1%
• Healthy patients: ~99% to 99.99%
• Imbalance ratio: 100:1 to 10000:1

Network Intrusion

• Malicious traffic: ~0.1% to 5%
• Normal traffic: ~95% to 99.9%
• Imbalance ratio: 20:1 to 1000:1

Manufacturing Defects

• Defective products: ~0.01% to 1%
• Good products: ~99% to 99.99%
• Imbalance ratio: 100:1 to 10000:1

The Accuracy Paradox

A "dumb" classifier that always predicts the majority class achieves:

Fraud (0.5% positive): 99.5% accuracy

Disease (0.1% positive): 99.9% accuracy

High accuracy is meaningless! The classifier detects zero fraud or disease cases.

Why Standard Classifiers Fail

• Optimization objectives (accuracy, MSE) favor majority class
• Decision boundaries pushed toward minority class
• Minority class treated as noise or outliers
• Learning algorithms don't see enough minority examples
• Gradient updates dominated by majority class

Undersampling Techniques

Reducing majority class samples to balance the dataset

Random Undersampling

Randomly remove samples from the majority class until achieving desired balance (typically 1:1 ratio).

Advantages

Simple to implement and understand
Fast training (smaller dataset)
Reduces computational cost
Works well with very large datasets

Disadvantages

Discards potentially useful information
Can remove important majority class patterns
May underfit if too aggressive
High variance (depends on random sample)

Informed Undersampling Methods

Tomek Links

Remove majority class samples that are nearest neighbors of minority class samples. This cleans the decision boundary by removing ambiguous cases.

Use when: You want to clean boundary regions while keeping most data.

Edited Nearest Neighbors (ENN)

Remove majority class samples whose k-nearest neighbors don't agree on the class label. Filters out noisy or misclassified samples.

Use when: Dataset has noise and you want cleaner boundaries.

NearMiss

Selectively keep majority class samples closest to minority class samples. Different variants (NearMiss-1, 2, 3) use different distance criteria.

Use when: You want to focus on decision boundary regions.

Oversampling & SMOTE

Generating synthetic minority class samples

Random Oversampling

Randomly duplicate minority class samples until achieving balance. Simple but prone to overfitting because it creates exact copies.

Problem: Model memorizes minority class samples instead of learning general patterns. Exact duplicates don't add new information, just increase the weight of existing samples.

SMOTE (Synthetic Minority Over-sampling Technique)

Generate synthetic samples along the line segments connecting minority class k-nearest neighbors. This creates realistic new samples instead of exact duplicates.

SMOTE Algorithm

1. For each minority class sample x_i:
a) Find its k nearest minority class neighbors (typically k=5)
b) Randomly select one neighbor x_nn
c) Generate synthetic sample: x_new = x_i + λ × (x_nn - x_i)
where λ is random number between 0 and 1
2. Repeat until desired minority class size is reached

SMOTE Example: Fraud Detection

Original dataset: 10,000 transactions (100 fraud, 9,900 legitimate)

Original Fraudulent Transaction

Amount: $2,450

Time: 03:15 AM

Distance from home: 5,000 mi

Merchant type: Electronics

Nearest Neighbor Fraud

Amount: $3,200

Time: 02:30 AM

Distance from home: 8,000 mi

Merchant type: Jewelry

Generated Synthetic Fraud (λ=0.6)

Amount: $2,450 + 0.6 × ($3,200 - $2,450) = $2,900

Time: 03:15 + 0.6 × (02:30 - 03:15) = 02:48 AM

Distance: 5,000 + 0.6 × (8,000 - 5,000) = 6,800 mi

Merchant: Electronics (categorical, use nearest)

SMOTE generates 9,900 synthetic fraud samples → balanced 10,000 vs 10,000 dataset

SMOTE Variants

Borderline-SMOTE

Only oversample minority samples near decision boundary (where misclassification risk is highest). More focused and efficient than standard SMOTE.

ADASYN

Adaptive Synthetic Sampling: generates more synthetic samples for minority samples that are harder to learn (surrounded by majority class).

SMOTE-ENN

Combines SMOTE oversampling with ENN undersampling. First generates synthetic samples, then cleans noisy regions. Best of both worlds.

SMOTE-Tomek

SMOTE followed by Tomek link removal. Oversamples minority then removes boundary ambiguities. Cleaner decision regions.

SMOTE Pros & Cons

Advantages

• Creates new synthetic samples (no exact duplicates)
• Reduces overfitting compared to random oversampling
• Widely used and well-studied (de facto standard)
• Works with any classifier
• Available in imbalanced-learn library

Disadvantages

• Can generate noisy samples in overlapping regions
• Assumes linear relationships between neighbors
• Computationally expensive for very large datasets
• May amplify label noise if minority class has errors
• Doesn't work well with high-dimensional sparse data

Threshold Moving & Cost-Sensitive Learning

Adjusting decision thresholds and error costs

Threshold Moving (Re-calibration)

Instead of resampling data, adjust the classification threshold. For logistic regression, instead of classifying as positive when P(y=1|x) ≥ 0.5, lower the threshold to favor minority class.

How to Choose the Threshold

Method 1: ROC Curve Analysis
Plot True Positive Rate vs False Positive Rate for different thresholds. Choose threshold at the point closest to top-left corner (or based on business requirements).

Method 2: Precision-Recall Trade-off
Plot precision vs recall curves. Select threshold based on whether precision or recall is more important for your application.

Method 3: F-beta Score
Optimize F_β score on validation set, where β controls trade-off between precision and recall. Use β > 1 to favor recall, β < 1 to favor precision.

Disease Screening Example

Rare disease screening: 0.5% prevalence, cost of missing case ($100,000) >> cost of false alarm ($500)

Standard Threshold (0.5)

• Sensitivity (Recall): 65% - misses 35% of cases!

• Specificity: 99.2% - few false alarms

• Problem: Too many missed diagnoses (false negatives)

Lowered Threshold (0.15)

• Sensitivity (Recall): 92% - catches most cases!

• Specificity: 95% - more false alarms but acceptable

• Benefit: Saves lives by detecting more true positives

By lowering threshold from 0.5 to 0.15, we increase recall from 65% to 92%, saving 27 additional lives per 100 sick patients at the cost of more follow-up tests for healthy patients.

Cost-Sensitive Learning

Incorporate misclassification costs directly into the learning objective. Instead of minimizing error rate, minimize expected cost:

Expected Cost = C_FN × P(FN) + C_FP × P(FP)

where C_FN is cost of false negative, C_FP is cost of false positive

Implementation Approaches:

• Class weights: Set class_weight parameter (e.g., {0: 1, 1: 100} in sklearn)
• Sample weights: Assign higher weights to minority class samples during training
• Cost matrix: Modify loss function to incorporate cost of each error type
• MetaCost: Wrapper method that makes any classifier cost-sensitive

When to Use Each Approach

Approach	Best For	Limitation
Threshold Moving	Already-trained model, quick adjustment needed	Requires probabilistic outputs
Class Weights	Built-in support in algorithm, no data modification	Not all algorithms support weights
Cost-Sensitive	Clear business costs known, high-stakes decisions	Requires cost estimation, complex implementation

Evaluation Metrics for Imbalanced Data

Choosing appropriate metrics beyond accuracy

Avoid These Metrics for Imbalanced Data

• Accuracy: Misleading when classes are imbalanced (99% accuracy means nothing)
• Error Rate: Same problem as accuracy, doesn't reflect minority class performance

Recommended Metrics

Precision & Recall

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

Focus on the minority class (positive). Recall is critical for disease screening, precision for spam detection.

F1 Score & F-beta

F1 = 2 × (Prec × Rec) / (Prec + Rec)

F_β = (1+β²) × (Prec × Rec) / (β²×Prec + Rec)

Harmonic mean balances precision and recall. Use β=2 to weight recall twice as much as precision.

ROC-AUC

Area Under ROC Curve

Measures classifier's ability to rank positive samples higher than negative ones. Good overall metric. AUC = 0.5 is random, 1.0 is perfect.

PR-AUC

Area Under Precision-Recall Curve

Often better than ROC-AUC for severe imbalance. More sensitive to minority class performance. Baseline is fraction of positives.

Balanced Accuracy

(TPR + TNR) / 2

Average of recall for each class. Gives equal weight to performance on each class regardless of size. Good for multi-class imbalance.

Cohen's Kappa

Agreement adjusted for chance

Accounts for the possibility of correct classification by chance. κ = 0 is random, 1 is perfect. More informative than accuracy for imbalanced data.

Choosing the Right Metric

• Medical diagnosis: Recall (don't miss sick patients)

• Spam detection: Precision (don't block legitimate emails)

• Fraud detection: F2 score (favor recall but maintain some precision)

• General comparison: PR-AUC or F1 score

• Multiple models: ROC-AUC for threshold-independent comparison

Practical Recommendations & Workflow

A step-by-step approach to handling class imbalance

Recommended Workflow

1. Establish baseline: Train model on imbalanced data with standard settings. Measure recall, precision, F1, and PR-AUC.
2. Try class weights first: Easy to implement, no data modification. Set class_weight='balanced' or manual weights inversely proportional to class frequencies.
3. If insufficient, try SMOTE: Use SMOTE or SMOTE-ENN to generate synthetic samples. Start with 1:1 ratio, then experiment.
4. Optimize threshold: Use ROC curve or PR curve to find optimal threshold based on your precision-recall trade-off requirements.
5. Ensemble methods: Try balanced random forest (samples balanced subsets for each tree) or EasyEnsemble (combines multiple undersampled models).
6. Collect more minority data: If possible, actively collect more examples of the rare class. Real data beats synthetic data.

Method Selection Guide

Scenario	Recommended Approach	Why
Mild imbalance (1:10)	Class weights or threshold moving	Simple, effective, no data modification
Moderate imbalance (1:100)	SMOTE + class weights	Combines synthetic data with algorithm adjustment
Severe imbalance (1:1000+)	SMOTE-ENN + ensemble + cost-sensitive	Needs multiple techniques combined
Very large dataset	Random undersampling or class weights	SMOTE too slow, undersampling discards less info
Small minority class	Borderline-SMOTE or ADASYN	Focuses on difficult boundary cases
Critical application	Cost-sensitive learning	Directly incorporates business costs

Key Insights

• No single method works for all problems—experiment with multiple approaches
• Always use stratified k-fold cross-validation to maintain class balance in each fold
• Combine techniques: SMOTE + class weights often outperforms either alone
• Monitor both precision and recall—don't optimize only one metric
• Be wary of data leakage: apply SMOTE only to training set, never to test set
• Consider domain-specific solutions: anomaly detection, one-class SVM for extreme cases

Common Questions About Class Imbalance

Should I apply SMOTE before or after splitting train/test data?

Always after splitting, and only to training data. If you apply SMOTE before splitting, synthetic samples in the test set might be very similar to samples in the training set (they're interpolations!), leading to overly optimistic performance estimates. This is a form of data leakage. Correct workflow: Split → SMOTE on train set only → Train → Evaluate on original imbalanced test set.

What imbalance ratio is considered "severe"?

No strict definition, but rough guidelines: 1:10 is mild, 1:100 is moderate, 1:1000+ is severe. However, severity also depends on absolute minority class size. Having 100,000 minority samples in a 10 million dataset (1:100 ratio) is less problematic than having 10 minority samples in a 1,000 dataset (1:100 ratio). With too few minority samples, even the best techniques struggle—focus on collecting more data.

Can I use deep learning for highly imbalanced data?

Yes, but with modifications. Deep learning typically needs even more minority class examples. Approaches: (1) Use focal loss instead of cross-entropy (designed for imbalance), (2) Class weights in loss function, (3) Oversample minority class in each batch, (4) Use pre-trained models and fine-tune with class balancing, or (5) Try contrastive learning or metric learning approaches. For extreme imbalance (<100 minority samples), traditional ML with SMOTE often outperforms deep learning.

How do I know if my imbalance handling is working?

Compare metrics before and after intervention. Look for: (1) Increased recall on minority class without catastrophic precision drop, (2) Higher F1 or PR-AUC scores, (3) Confusion matrix showing more true positives and fewer false negatives, (4) ROC-AUC improvement. If precision drops to <10% while chasing high recall, you may be overfitting. Use cross-validation to ensure improvements generalize. The test set should remain imbalanced to reflect real-world deployment.