Learn how to combine individual learners effectively: from simple voting and averaging to advanced learning-based methods like Stacking
Once you have T individual learners h₁, h₂, ..., hₜ, the next critical step is deciding how to combine their predictions. The combination strategy significantly affects ensemble performance. There are three main approaches:
For regression tasks. Simple or weighted average of predictions.
For classification tasks. Majority vote or weighted vote.
Learning-based. Train a meta-learner to combine predictions.
For regression tasks, we combine continuous predictions using averaging:
The simplest method: take the arithmetic mean of all predictions.
H(x) = (1/T) Σₜ hₜ(x)
All learners contribute equally to the final prediction
Example:
If 5 trees predict house prices: [$280k, $290k, $285k, $275k, $290k], then H(x) = ($280k + $290k + $285k + $275k + $290k) / 5 = $284k
Give more weight to better-performing learners:
H(x) = Σₜ wₜ hₜ(x)
Where wₜ ≥ 0 and Σₜ wₜ = 1
Weights are typically based on individual learner performance (e.g., validation error)
Example:
If 3 models predict with weights [0.4, 0.3, 0.3] and predictions [$280k, $290k, $275k]:
H(x) = 0.4 × $280k + 0.3 × $290k + 0.3 × $275k = $281.5k
Note: Weighted averaging is not always better than simple averaging. If weight estimates are noisy, simple averaging can be more robust.
Use Simple Averaging when:
Use Weighted Averaging when:
For classification tasks, we combine discrete predictions using voting:
Output the class that receives the most votes. This is the most common voting method.
H(x) = argmax_c Σₜ [hₜ(x) = c]
Where [hₜ(x) = c] is 1 if hₜ predicts class c, 0 otherwise
Example:
5 models predict: [Positive, Positive, Negative, Positive, Negative]
Votes: Positive = 3, Negative = 2 → H(x) = Positive
Output a class only if it receives more than half the votes. Otherwise, reject (no prediction).
H(x) = c if Σₜ [hₜ(x) = c] > T/2, else REJECT
More conservative: only predicts when there's strong consensus
Example:
With 5 models: [Positive, Positive, Negative, Positive, Negative]
Positive = 3 votes (60% > 50%) → H(x) = Positive
If votes were [Positive, Positive, Negative, Negative, Negative]:
Positive = 2, Negative = 3 (neither > 50%) → H(x) = REJECT
Each learner's vote is weighted by its performance. Better learners have more influence.
H(x) = argmax_c Σₜ wₜ [hₜ(x) = c]
Where wₜ is the weight of learner hₜ (typically based on accuracy or error rate)
Example:
3 models with weights [0.4, 0.3, 0.3] predict [Positive, Positive, Negative]:
Weighted votes: Positive = 0.4 + 0.3 = 0.7, Negative = 0.3
H(x) = Positive (0.7 > 0.3)
Stacking (also called Stacked Generalization) uses a meta-learnerto learn how to best combine the predictions of first-level learners:
A medical diagnosis system uses 3 specialized models with weighted voting:
| ID | Age | BMI | Glucose | Blood Pressure | Symptoms | Diagnosis |
|---|---|---|---|---|---|---|
| 1 | 45 | 28.5 | 95 | 130 | Mild | Negative |
| 2 | 62 | 32.1 | 145 | 155 | Moderate | Positive |
| 3 | 38 | 24.8 | 88 | 120 | None | Negative |
| 4 | 55 | 29.7 | 132 | 142 | Moderate | Positive |
| 5 | 41 | 26.3 | 102 | 128 | Mild | Negative |
Model 1 (Lab-based): Positive (weight: 0.3)
Model 2 (Symptom-based): Positive (weight: 0.4)
Model 3 (Risk-factor): Positive (weight: 0.3)
Weighted vote result: Positive (unanimous)
All models agree, so the ensemble has very high confidence
A financial system uses Stacking to combine predictions from 3 different models:
| ID | Volume | Volatility | Earnings | Price |
|---|---|---|---|---|
| 1 | 1,500,000 | 0.15 | $2.5 | $125.5 |
| 2 | 2,300,000 | 0.22 | $1.8 | $98.3 |
| 3 | 1,800,000 | 0.18 | $3.2 | $145.8 |
| 4 | 2,100,000 | 0.25 | $1.5 | $87.2 |
| 5 | 1,650,000 | 0.19 | $2.8 | $132.4 |
Linear Regression prediction: $124.20
Random Forest prediction: $126.80
Neural Network prediction: $125.10
Meta-learner input: [124.20, 126.80, 125.10]
Meta-learner output (final prediction): $125.37
The meta-learner learns optimal weights (e.g., 0.2, 0.5, 0.3) to combine the three predictions
A: No. If weight estimates are unreliable or noisy, simple averaging can be more robust. Weighted averaging helps when you have confident, accurate performance estimates and clear differences between learners.
A: Use absolute majority when you need high confidence and can afford to reject uncertain cases (e.g., medical diagnosis). Use relative majority when you always need a prediction (e.g., recommendation systems).
A: Use cross-validation to generate first-level predictions. For each fold, train first-level learners on training fold and predict on validation fold. This ensures meta-learner sees "unseen" predictions, preventing data leakage.
A: Yes! For example, you can use Stacking where the meta-learner itself uses weighted voting. Or use different strategies for different subsets of learners. This is called "blending" and can further improve performance.