Master combining multiple models for superior predictions using bagging, boosting, and stacking
Ensemble methods combine multiple models to achieve better performance than any single model:
Diversity
Different models make different errors
Aggregation
Combining cancels out individual errors
Performance
Often beats single best model
For regression with independent models, each with error variance :
Averaging reduces variance! With correlated models (correlation ):
Lower correlation () → greater variance reduction → better ensemble
Random Forest extends bagging by adding randomness at each tree split:
• Decorrelates trees: Without feature randomness, trees in bagging are still correlated (same strong features dominate)
• Increases diversity: Forces each tree to consider different features
• Ensemble benefit: Lower correlation → better variance reduction
Boosting trains models sequentially, each focusing on correcting previous errors:
where are weak learners, are weights (higher for better models)
Adaptive Boosting - adjusts sample weights:
Algorithm:
Fit new model to negative gradient of loss:
Algorithm:
Bagging:
Boosting:
Extreme Gradient Boosting - optimized implementation:
Key Features:
🏆 Dominates ML competitions (Kaggle)
Learn how to combine diverse base models:
Algorithm:
Meta-model learns optimal weighting (can be linear regression, logistic regression, or another ML model)
LightGBM
Microsoft's fast gradient boosting. Uses histogram-based algorithms and GOSS sampling.
CatBoost
Yandex's boosting. Handles categorical features natively, reduces overfitting.
Voting
Simple: hard voting (majority) or soft voting (average probabilities).
10,000 customers, 20 features (income, age, debt ratio, payment history...), binary target (default: yes/no)
| Model | Accuracy | AUC | Training Time |
|---|---|---|---|
| Single Decision Tree | 82% | 0.78 | 1s |
| Bagging (100 trees) | 87% | 0.89 | 15s |
| Random Forest | 89% | 0.92 | 18s |
| AdaBoost | 88% | 0.91 | 25s |
| Gradient Boosting | 90% | 0.93 | 40s |
| XGBoost | 91% | 0.94 | 12s |
Train B models on bootstrap samples, then average:
Theorem: If models are independent with variance σ²:
Variance reduced by factor of B!
With correlation ρ between any two models:
Key Insight:
Probability a sample appears in bootstrap:
About 63% of original samples in each bootstrap, 37% out-of-bag (OOB) for validation.
Initialize: w_i^(1) = 1/n for all samples
For t = 1 to T:
Final:
AdaBoost minimizes exponential loss:
Greedy Optimization (stage t):
Given fixed α_1,...,α_(t-1), h_1,...,h_(t-1), choose α_t and h_t to minimize:
where w_i^(t) = exp(-y_i Σ_(s<t) α_s h_s(x_i)) are previous round weights
Split sum into correct and incorrect predictions:
Take derivative and set to zero:
Theorem: Training error of final classifier:
If each weak learner has ε_t < 0.5, this product → 0 exponentially fast!
Goal: Minimize loss L over functions F(x):
Build F iteratively: F_t = F_(t-1) + ν h_t, where h_t is chosen to reduce loss
At iteration t, compute negative gradient (pseudo-residuals):
Examples:
Initialize: F_0(x) = argmin_γ Σ L(y_i, γ)
For t = 1 to T:
1. Compute pseudo-residuals r_it for all i
2. Fit base learner h_t to [(x_i, r_it)]
3. Find optimal step size:
ν_t = argmin_ν Σ L(y_i, F_(t-1)(x_i) + ν h_t(x_i))
4. Update: F_t(x) = F_(t-1)(x) + ν_t h_t(x)
Return: F_T(x)
Regularization: Scale updates by learning rate η ∈ (0,1]:
Smaller η → more iterations needed but better generalization. Typical: η = 0.01-0.1
Second-order Taylor approximation:
where g_i = ∂L/∂F, h_i = ∂²L/∂F². Uses Newton boosting for faster convergence.
If one feature is very strong, all trees will use it at top split → high correlation ρ
Recall: Var(ensemble) ≈ σ²/B + ρσ². High ρ limits variance reduction!
At each split, randomly sample m features (out of p total):
Effect on Correlation:
If one feature dominates, probability it's available at any split: m/p
Trees forced to consider different features → lower correlation between trees
Expected correlation for Random Forest:
For m=√p and p=100: correlation reduced by factor of √100 = 10!
Small m (high randomness)
Large m (less randomness)
Test your understanding with 10 multiple-choice questions