MathIsimple
Back to Machine Learning
Ensemble Methods
Intermediate to Advanced

Ensemble Learning

Master combining multiple models for superior predictions using bagging, boosting, and stacking

Why Ensembles Work

The Wisdom of Crowds

Ensemble methods combine multiple models to achieve better performance than any single model:

Diversity

Different models make different errors

Aggregation

Combining cancels out individual errors

Performance

Often beats single best model

Mathematical Intuition

For regression with independent models, each with error variance :

Ensemble variance=σ2n\text{Ensemble variance} = \frac{\sigma^2}{n}

Averaging reduces variance! With correlated models (correlation ):

Ensemble variance=ρσ2+1ρnσ2\text{Ensemble variance} = \rho\sigma^2 + \frac{1-\rho}{n}\sigma^2

Lower correlation () → greater variance reduction → better ensemble

Bagging (Bootstrap Aggregating)

Parallel Training with Bootstrap Sampling

Algorithm

  1. 1. For to (number of models):
  2. • Generate bootstrap sample by sampling points with replacement
  3. • Train base model on
  4. 2. Aggregate predictions:
  5. • Classification: Majority vote
  6. • Regression: Average

Advantages

  • • Reduces variance (especially for high-variance models like deep trees)
  • • Parallelizable training
  • • Simple and effective
  • • Out-of-bag error estimation

Best Practices

  • • Use unstable base learners (decision trees)
  • • More trees → better (no overfitting risk)
  • • Typical: 50-500 trees
  • • Each bootstrap ~63% unique samples

Random Forest

Bagging + Random Feature Selection

Random Forest extends bagging by adding randomness at each tree split:

Modified Tree Growing:

  1. • At each node, randomly select features (subset of total)
  2. • Choose best split only from these features
  3. • Typical: (classification) or (regression)
  4. • Repeat for each node independently

Why It Works

Decorrelates trees: Without feature randomness, trees in bagging are still correlated (same strong features dominate)

Increases diversity: Forces each tree to consider different features

Ensemble benefit: Lower correlation → better variance reduction

Key Properties

  • • Very few hyperparameters to tune
  • • Robust to outliers and noise
  • • Handles mixed feature types well
  • • Provides feature importance estimates
  • • OOB error for free validation

Boosting

Sequential Learning from Mistakes

Boosting trains models sequentially, each focusing on correcting previous errors:

F(x)=t=1Tαtht(x)F(x) = \sum_{t=1}^{T} \alpha_t h_t(x)

where are weak learners, are weights (higher for better models)

AdaBoost

Adaptive Boosting - adjusts sample weights:

Algorithm:

  1. 1. Initialize weights:
  2. 2. For to :
  3. • Train on weighted data
  4. • Compute weighted error
  5. • Set weight based on error rate
  6. • Update weights (increase for errors)
  7. 3. Final:
Gradient Boosting

Fit new model to negative gradient of loss:

Algorithm:

  1. 1. Initialize:
  2. 2. For to :
  3. • Compute residuals/pseudo-residuals:
  4. • Fit to predict
  5. • Update:
  6. 3. Output:

Key Differences: Bagging vs Boosting

Bagging:

  • • Parallel training
  • • Equal weights
  • • Reduces variance
  • • Less prone to overfitting

Boosting:

  • • Sequential training
  • • Weighted combination
  • • Reduces bias
  • • Can overfit if too many rounds

Modern Ensemble Methods

XGBoost

Extreme Gradient Boosting - optimized implementation:

Key Features:

  • Regularization: penalty on leaf weights
  • Parallelization: Column block for parallel tree construction
  • Tree Pruning: Max depth, then prune back
  • Missing Values: Learn best direction automatically
  • Built-in CV: Cross-validation during training

🏆 Dominates ML competitions (Kaggle)

Stacking (Stacked Generalization)

Learn how to combine diverse base models:

Algorithm:

  1. 1. Level 0: Train diverse base models (SVM, RF, NN, etc.) using CV
  2. 2. Collect predictions: For each sample, get predictions from all base models
  3. 3. Level 1: Train meta-model using base predictions as features
  4. 4. Test: Base models predict → Meta-model combines

Meta-model learns optimal weighting (can be linear regression, logistic regression, or another ML model)

Other Notable Methods

LightGBM

Microsoft's fast gradient boosting. Uses histogram-based algorithms and GOSS sampling.

CatBoost

Yandex's boosting. Handles categorical features natively, reduces overfitting.

Voting

Simple: hard voting (majority) or soft voting (average probabilities).

Example: Credit Default Prediction

Comparing Ensemble Methods

Dataset

10,000 customers, 20 features (income, age, debt ratio, payment history...), binary target (default: yes/no)

Model Comparison

ModelAccuracyAUCTraining Time
Single Decision Tree82%0.781s
Bagging (100 trees)87%0.8915s
Random Forest89%0.9218s
AdaBoost88%0.9125s
Gradient Boosting90%0.9340s
XGBoost91%0.9412s

Observations

  • • All ensembles beat single tree significantly
  • • Random Forest: best balance of speed and accuracy
  • • XGBoost: highest performance, optimized speed
  • • Boosting methods generally outperform bagging (but slower training)

Bagging: Variance Reduction Proof

Mathematical Justification for Bootstrap Aggregating

Setup

Train B models on bootstrap samples, then average:

f^(x)=1Bb=1Bhb(x)\hat{f}(x) = \frac{1}{B}\sum_{b=1}^{B} h_b(x)

Variance Reduction (Independent Case)

Theorem: If models are independent with variance σ²:

Var(1Bb=1Bhb)=1B2b=1BVar(hb)=Bσ2B2=σ2B\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} h_b\right) = \frac{1}{B^2}\sum_{b=1}^{B} \text{Var}(h_b) = \frac{B\sigma^2}{B^2} = \frac{\sigma^2}{B}

Variance reduced by factor of B!

Correlated Predictors (Realistic Case)

With correlation ρ between any two models:

Var(hˉ)=1B2[Bσ2+B(B1)ρσ2]\text{Var}(\bar{h}) = \frac{1}{B^2}\left[B\sigma^2 + B(B-1)\rho\sigma^2\right]=σ2B+B1Bρσ2σ2B+ρσ2= \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2 \approx \frac{\sigma^2}{B} + \rho\sigma^2

Key Insight:

  • • First term (1/B): vanishes as B→∞
  • • Second term (ρ): irreducible, depends on correlation
  • • Lower correlation = better variance reduction!

Bootstrap Statistics

Probability a sample appears in bootstrap:

P(sample i selected)=1(11n)n1e10.632P(\text{sample } i \text{ selected}) = 1 - \left(1 - \frac{1}{n}\right)^n \to 1 - e^{-1} \approx 0.632

About 63% of original samples in each bootstrap, 37% out-of-bag (OOB) for validation.

AdaBoost: Weight Update Derivation

Exponential Loss Minimization

AdaBoost Algorithm

Initialize: w_i^(1) = 1/n for all samples

For t = 1 to T:

  1. Train weak learner h_t on weighted data
  2. Compute weighted error: ϵt=i:ht(xi)yiwi(t)\epsilon_t = \sum_{i:h_t(x_i)\neq y_i} w_i^{(t)}
  3. Compute model weight: αt=12ln1ϵtϵt\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}
  4. Update weights: wi(t+1)=wi(t)exp(αtyiht(xi))w_i^{(t+1)} = w_i^{(t)} \exp(-\alpha_t y_i h_t(x_i))
  5. Normalize weights to sum to 1

Final: H(x)=sign(t=1Tαtht(x))H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right)

Why These Formulas? Exponential Loss

AdaBoost minimizes exponential loss:

L=i=1nexp(yit=1Tαtht(xi))L = \sum_{i=1}^{n} \exp\left(-y_i \sum_{t=1}^{T} \alpha_t h_t(x_i)\right)

Greedy Optimization (stage t):

Given fixed α_1,...,α_(t-1), h_1,...,h_(t-1), choose α_t and h_t to minimize:

iwi(t)exp(αtyiht(xi))\sum_i w_i^{(t)} \exp(-\alpha_t y_i h_t(x_i))

where w_i^(t) = exp(-y_i Σ_(s<t) α_s h_s(x_i)) are previous round weights

Derivation of α_t Formula

Split sum into correct and incorrect predictions:

Lt=yi=ht(xi)wieαt+yiht(xi)wieαtL_t = \sum_{y_i=h_t(x_i)} w_i e^{-\alpha_t} + \sum_{y_i\neq h_t(x_i)} w_i e^{\alpha_t}=(1ϵt)eαt+ϵteαt= (1-\epsilon_t)e^{-\alpha_t} + \epsilon_t e^{\alpha_t}

Take derivative and set to zero:

dLtdαt=(1ϵt)eαt+ϵteαt=0\frac{dL_t}{d\alpha_t} = -(1-\epsilon_t)e^{-\alpha_t} + \epsilon_t e^{\alpha_t} = 0ϵt1ϵt=e2αt\frac{\epsilon_t}{1-\epsilon_t} = e^{-2\alpha_t}
αt=12ln1ϵtϵt\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}

Training Error Bound

Theorem: Training error of final classifier:

1ni1[H(xi)yi]t=1T2ϵt(1ϵt)\frac{1}{n}\sum_i \mathbb{1}[H(x_i) \neq y_i] \leq \prod_{t=1}^{T} 2\sqrt{\epsilon_t(1-\epsilon_t)}

If each weak learner has ε_t < 0.5, this product → 0 exponentially fast!

Gradient Boosting Mathematical Framework

Functional Gradient Descent

Framework

Goal: Minimize loss L over functions F(x):

minFi=1nL(yi,F(xi))\min_F \sum_{i=1}^{n} L(y_i, F(x_i))

Build F iteratively: F_t = F_(t-1) + ν h_t, where h_t is chosen to reduce loss

Pseudo-Residual (Negative Gradient)

At iteration t, compute negative gradient (pseudo-residuals):

rit=L(yi,F(xi))F(xi)F=Ft1r_{it} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\Bigg|_{F=F_{t-1}}

Examples:

  • Squared loss: r_it = y_i - F_(t-1)(x_i) (actual residuals!)
  • Absolute loss: r_it = sign(y_i - F_(t-1)(x_i))
  • Huber loss: Clipped residuals

Gradient Boosting Algorithm

Initialize: F_0(x) = argmin_γ Σ L(y_i, γ)

For t = 1 to T:

1. Compute pseudo-residuals r_it for all i

2. Fit base learner h_t to [(x_i, r_it)]

3. Find optimal step size:

ν_t = argmin_ν Σ L(y_i, F_(t-1)(x_i) + ν h_t(x_i))

4. Update: F_t(x) = F_(t-1)(x) + ν_t h_t(x)

Return: F_T(x)

Shrinkage (Learning Rate)

Regularization: Scale updates by learning rate η ∈ (0,1]:

Ft(x)=Ft1(x)+ηνtht(x)F_t(x) = F_{t-1}(x) + \eta \cdot \nu_t h_t(x)

Smaller η → more iterations needed but better generalization. Typical: η = 0.01-0.1

XGBoost Innovations

Second-order Taylor approximation:

L(y,F+h)L(y,F)+gih(xi)+12hi2h(xi)2L(y, F + h) \approx L(y, F) + g_i h(x_i) + \frac{1}{2}h_i^2 h(x_i)^2

where g_i = ∂L/∂F, h_i = ∂²L/∂F². Uses Newton boosting for faster convergence.

Random Forest: Decorrelation Proof

Feature Randomization Reduces Correlation

Problem with Standard Bagging

If one feature is very strong, all trees will use it at top split → high correlation ρ

Recall: Var(ensemble) ≈ σ²/B + ρσ². High ρ limits variance reduction!

Random Forest Solution

At each split, randomly sample m features (out of p total):

  • • Typical: m = √p for classification
  • • Typical: m = p/3 for regression

Effect on Correlation:

If one feature dominates, probability it's available at any split: m/p

Trees forced to consider different features → lower correlation between trees

Theoretical Analysis

Expected correlation for Random Forest:

E[ρRF]mpρbagging\mathbb{E}[\rho_{\text{RF}}] \approx \frac{m}{p} \cdot \rho_{\text{bagging}}

For m=√p and p=100: correlation reduced by factor of √100 = 10!

Bias-Variance Tradeoff in m

Small m (high randomness)

  • • Lower correlation ✓
  • • Lower variance ✓
  • • Higher bias (weaker trees) ✗

Large m (less randomness)

  • • Higher correlation ✗
  • • Higher variance ✗
  • • Lower bias (stronger trees) ✓

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is the main idea behind ensemble learning?
Not attempted
2
In bagging (Bootstrap Aggregating), each base model is trained on:
Not attempted
3
Random Forest is an extension of bagging that adds:
Not attempted
4
What does boosting do differently from bagging?
Not attempted
5
In AdaBoost, after iteration , how are sample weights updated for correctly classified sample ?
Not attempted
6
Gradient Boosting builds the ensemble by:
Not attempted
7
What is 'Out-of-Bag' (OOB) error in Random Forest?
Not attempted
8
Which ensemble method typically has the highest variance reduction?
Not attempted
9
What makes XGBoost faster and more effective than basic Gradient Boosting?
Not attempted
10
In stacking, how is the meta-model (level-1 model) trained?
Not attempted