Understand why diversity is crucial for ensemble success and learn how to measure and enhance diversity using various metrics and techniques
Diversity is the fundamental requirement for effective ensemble learning. If all individual learners make the same mistakes, the ensemble cannot improve over a single learner. Diversity ensures that learners complement each other, allowing the ensemble to correct individual errors.
The ensemble's generalization error can be decomposed as:
E = Ē - Ā
Where:
Interpretation: To minimize ensemble error E, we need both low individual error Ē (accurate learners) and high diversity Ā (different learners). This is the accuracy-diversity tradeoff.
The error-diversity decomposition provides a theoretical framework for understanding ensemble performance:
E = Ē - Ā
Where:
Key Insights:
The error-diversity decomposition is straightforward for regression but more complex for classification. For classification, we use diversity metrics (discussed below) to measure diversity indirectly.
The core principle remains: ensemble error = average individual error - diversity benefit
For binary classification, we measure diversity between pairs of learners using various metrics:
The simplest diversity metric: proportion of samples where two learners disagree.
disᵢⱼ = (b + c) / m
Where (for learners hᵢ and hⱼ):
Example:
For 200 samples: a=120, b=30, c=30, d=20
dis = (30 + 30) / 200 = 0.15 (15% disagreement)
Range: [0, 1]. Higher values indicate more diversity.
Measures the correlation between two learners' predictions:
ρᵢⱼ = (ad - bc) / √[(a+b)(a+c)(c+d)(b+d)]
Where a, b, c, d are defined as above
Example:
For a=120, b=30, c=30, d=20:
ρ = (120×20 - 30×30) / √[(150)(150)(50)(150)] ≈ 0.27
Range: [-1, 1]. Lower values (closer to 0) indicate more diversity. ρ = 1 means perfect agreement, ρ = -1 means perfect disagreement.
Another correlation-based measure, normalized differently:
Qᵢⱼ = (ad - bc) / (ad + bc)
Simpler form than correlation coefficient
Properties:
Measures agreement beyond chance, commonly used in inter-rater reliability:
κ = (P₀ - Pₑ) / (1 - Pₑ)
Where:
Interpretation:
Various techniques can be used to increase diversity in ensembles:
Train learners on different subsets or weighted versions of the data:
Bootstrap Sampling (Bagging)
Each learner trains on a different bootstrap sample. Effective for unstable learners like decision trees.
Sequential Sampling (Boosting)
Adjust sample weights iteratively, focusing each new learner on previously misclassified samples.
Train learners on different subsets of features:
Random Subspace Method
Each learner uses a randomly selected subset of features. For example, if you have 10 features, each learner might use only 5 randomly chosen features.
Example: Random Forest uses this at each node (attribute randomness), but you can also use it at the dataset level.
Modify how learners represent or output predictions:
Output Flipping
Randomly flip some predictions to create diversity. Useful when you have very similar learners.
Error-Correcting Output Codes (ECOC)
Encode class labels using binary codes, train binary classifiers for each bit, then decode predictions. Increases diversity through different coding schemes.
Train learners with different hyperparameters or algorithm configurations:
Hyperparameter Variation
Train multiple learners with different hyperparameter settings:
Example: Train 5 neural networks with learning rates [0.001, 0.01, 0.1, 0.5, 1.0] to create diverse learners.
A bank analyzes diversity in their 5-model credit approval ensemble:
| ID | Income | Age | Employment | Debt | Credit Score | Approved |
|---|---|---|---|---|---|---|
| 1 | $45,000 | 28 | Full-time | $12,000 | 680 | Yes |
| 2 | $32,000 | 35 | Part-time | $18,000 | 620 | No |
| 3 | $75,000 | 42 | Full-time | $25,000 | 750 | Yes |
| 4 | $28,000 | 24 | Unemployed | $15,000 | 580 | No |
| 5 | $95,000 | 38 | Full-time | $35,000 | 720 | Yes |
| 6 | $41,000 | 31 | Full-time | $22,000 | 650 | Yes |
| 7 | $22,000 | 26 | Part-time | $19,000 | 590 | No |
| 8 | $68,000 | 45 | Full-time | $28,000 | 710 | Yes |
Agreement matrix: a=120, b=30, c=30, d=20 (out of 200 samples)
Disagreement: dis = (30+30)/200 = 0.15 (moderate diversity)
Correlation: ρ ≈ 0.27 (low correlation, good diversity)
Interpretation: Models 1 and 2 have good diversity (ρ < 0.3). Their combination should perform better than either alone.
A: For disagreement, aim for dis > 0.1 (at least 10% disagreement). For correlation, aim for ρ < 0.5 (preferably < 0.3). However, the optimal value depends on your specific problem and learners.
A: Yes, if diversity is too high, it usually means individual learners are too weak (high error rate). Remember: you need both accuracy AND diversity. Very high diversity with poor individual accuracy won't help.
A: For regression, use the diversity term Ā from error-diversity decomposition: Ā = (1/T) Σₜ E[(hₜ(x) - H(x))²]. Higher Ā means more diversity. You can also use correlation of predictions between learners.
A: It depends on your base learners. For decision trees, bootstrap sampling (Bagging) and attribute randomness (Random Forest) work excellently. For neural networks, hyperparameter variation is often effective. Combining multiple methods usually works best.