Learn to handle severely imbalanced datasets with undersampling, oversampling, SMOTE, and threshold moving techniques for fraud detection and rare disease screening
Class imbalance occurs when the number of samples in different classes varies dramatically. In extreme cases, the minority class might represent only 0.01% to 1% of the dataset. Standard classifiers trained on imbalanced data tend to be biased toward the majority class, achieving high overall accuracy while failing to detect minority class samples.
A "dumb" classifier that always predicts the majority class achieves:
Fraud (0.5% positive): 99.5% accuracy
Disease (0.1% positive): 99.9% accuracy
High accuracy is meaningless! The classifier detects zero fraud or disease cases.
Reducing majority class samples to balance the dataset
Randomly remove samples from the majority class until achieving desired balance (typically 1:1 ratio).
Remove majority class samples that are nearest neighbors of minority class samples. This cleans the decision boundary by removing ambiguous cases.
Use when: You want to clean boundary regions while keeping most data.
Remove majority class samples whose k-nearest neighbors don't agree on the class label. Filters out noisy or misclassified samples.
Use when: Dataset has noise and you want cleaner boundaries.
Selectively keep majority class samples closest to minority class samples. Different variants (NearMiss-1, 2, 3) use different distance criteria.
Use when: You want to focus on decision boundary regions.
Generating synthetic minority class samples
Randomly duplicate minority class samples until achieving balance. Simple but prone to overfitting because it creates exact copies.
Problem: Model memorizes minority class samples instead of learning general patterns. Exact duplicates don't add new information, just increase the weight of existing samples.
Generate synthetic samples along the line segments connecting minority class k-nearest neighbors. This creates realistic new samples instead of exact duplicates.
Original dataset: 10,000 transactions (100 fraud, 9,900 legitimate)
Amount: $2,450
Time: 03:15 AM
Distance from home: 5,000 mi
Merchant type: Electronics
Amount: $3,200
Time: 02:30 AM
Distance from home: 8,000 mi
Merchant type: Jewelry
Amount: $2,450 + 0.6 × ($3,200 - $2,450) = $2,900
Time: 03:15 + 0.6 × (02:30 - 03:15) = 02:48 AM
Distance: 5,000 + 0.6 × (8,000 - 5,000) = 6,800 mi
Merchant: Electronics (categorical, use nearest)
SMOTE generates 9,900 synthetic fraud samples → balanced 10,000 vs 10,000 dataset
Only oversample minority samples near decision boundary (where misclassification risk is highest). More focused and efficient than standard SMOTE.
Adaptive Synthetic Sampling: generates more synthetic samples for minority samples that are harder to learn (surrounded by majority class).
Combines SMOTE oversampling with ENN undersampling. First generates synthetic samples, then cleans noisy regions. Best of both worlds.
SMOTE followed by Tomek link removal. Oversamples minority then removes boundary ambiguities. Cleaner decision regions.
Adjusting decision thresholds and error costs
Instead of resampling data, adjust the classification threshold. For logistic regression, instead of classifying as positive when P(y=1|x) ≥ 0.5, lower the threshold to favor minority class.
Method 1: ROC Curve Analysis
Plot True Positive Rate vs False Positive Rate for different thresholds. Choose threshold at the point closest to top-left corner (or based on business requirements).
Method 2: Precision-Recall Trade-off
Plot precision vs recall curves. Select threshold based on whether precision or recall is more important for your application.
Method 3: F-beta Score
Optimize F_β score on validation set, where β controls trade-off between precision and recall. Use β > 1 to favor recall, β < 1 to favor precision.
Rare disease screening: 0.5% prevalence, cost of missing case ($100,000) >> cost of false alarm ($500)
• Sensitivity (Recall): 65% - misses 35% of cases!
• Specificity: 99.2% - few false alarms
• Problem: Too many missed diagnoses (false negatives)
• Sensitivity (Recall): 92% - catches most cases!
• Specificity: 95% - more false alarms but acceptable
• Benefit: Saves lives by detecting more true positives
By lowering threshold from 0.5 to 0.15, we increase recall from 65% to 92%, saving 27 additional lives per 100 sick patients at the cost of more follow-up tests for healthy patients.
Incorporate misclassification costs directly into the learning objective. Instead of minimizing error rate, minimize expected cost:
Expected Cost = C_FN × P(FN) + C_FP × P(FP)
where C_FN is cost of false negative, C_FP is cost of false positive
| Approach | Best For | Limitation |
|---|---|---|
| Threshold Moving | Already-trained model, quick adjustment needed | Requires probabilistic outputs |
| Class Weights | Built-in support in algorithm, no data modification | Not all algorithms support weights |
| Cost-Sensitive | Clear business costs known, high-stakes decisions | Requires cost estimation, complex implementation |
Choosing appropriate metrics beyond accuracy
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
Focus on the minority class (positive). Recall is critical for disease screening, precision for spam detection.
F1 = 2 × (Prec × Rec) / (Prec + Rec)
F_β = (1+β²) × (Prec × Rec) / (β²×Prec + Rec)
Harmonic mean balances precision and recall. Use β=2 to weight recall twice as much as precision.
Area Under ROC Curve
Measures classifier's ability to rank positive samples higher than negative ones. Good overall metric. AUC = 0.5 is random, 1.0 is perfect.
Area Under Precision-Recall Curve
Often better than ROC-AUC for severe imbalance. More sensitive to minority class performance. Baseline is fraction of positives.
(TPR + TNR) / 2
Average of recall for each class. Gives equal weight to performance on each class regardless of size. Good for multi-class imbalance.
Agreement adjusted for chance
Accounts for the possibility of correct classification by chance. κ = 0 is random, 1 is perfect. More informative than accuracy for imbalanced data.
• Medical diagnosis: Recall (don't miss sick patients)
• Spam detection: Precision (don't block legitimate emails)
• Fraud detection: F2 score (favor recall but maintain some precision)
• General comparison: PR-AUC or F1 score
• Multiple models: ROC-AUC for threshold-independent comparison
A step-by-step approach to handling class imbalance
| Scenario | Recommended Approach | Why |
|---|---|---|
| Mild imbalance (1:10) | Class weights or threshold moving | Simple, effective, no data modification |
| Moderate imbalance (1:100) | SMOTE + class weights | Combines synthetic data with algorithm adjustment |
| Severe imbalance (1:1000+) | SMOTE-ENN + ensemble + cost-sensitive | Needs multiple techniques combined |
| Very large dataset | Random undersampling or class weights | SMOTE too slow, undersampling discards less info |
| Small minority class | Borderline-SMOTE or ADASYN | Focuses on difficult boundary cases |
| Critical application | Cost-sensitive learning | Directly incorporates business costs |
Always after splitting, and only to training data. If you apply SMOTE before splitting, synthetic samples in the test set might be very similar to samples in the training set (they're interpolations!), leading to overly optimistic performance estimates. This is a form of data leakage. Correct workflow: Split → SMOTE on train set only → Train → Evaluate on original imbalanced test set.
No strict definition, but rough guidelines: 1:10 is mild, 1:100 is moderate, 1:1000+ is severe. However, severity also depends on absolute minority class size. Having 100,000 minority samples in a 10 million dataset (1:100 ratio) is less problematic than having 10 minority samples in a 1,000 dataset (1:100 ratio). With too few minority samples, even the best techniques struggle—focus on collecting more data.
Yes, but with modifications. Deep learning typically needs even more minority class examples. Approaches: (1) Use focal loss instead of cross-entropy (designed for imbalance), (2) Class weights in loss function, (3) Oversample minority class in each batch, (4) Use pre-trained models and fine-tune with class balancing, or (5) Try contrastive learning or metric learning approaches. For extreme imbalance (<100 minority samples), traditional ML with SMOTE often outperforms deep learning.
Compare metrics before and after intervention. Look for: (1) Increased recall on minority class without catastrophic precision drop, (2) Higher F1 or PR-AUC scores, (3) Confusion matrix showing more true positives and fewer false negatives, (4) ROC-AUC improvement. If precision drops to <10% while chasing high recall, you may be overfitting. Use cross-validation to ensure improvements generalize. The test set should remain imbalanced to reflect real-world deployment.