MathIsimple – Simple, Friendly Math Tools & Learning

Why Ensembles Work

The Wisdom of Crowds

Ensemble methods combine multiple models to achieve better performance than any single model:

Diversity

Different models make different errors

Aggregation

Combining cancels out individual errors

Performance

Often beats single best model

Mathematical Intuition

For regression with independent models, each with error variance :

\text{Ensemble variance} = \frac{\sigma^2}{n}

Averaging reduces variance! With correlated models (correlation ):

\text{Ensemble variance} = \rho\sigma^2 + \frac{1-\rho}{n}\sigma^2

Lower correlation () → greater variance reduction → better ensemble

Bagging (Bootstrap Aggregating)

Parallel Training with Bootstrap Sampling

Algorithm

1. For to (number of models):
• Generate bootstrap sample by sampling points with replacement
• Train base model on
2. Aggregate predictions:
• Classification: Majority vote
• Regression: Average

Advantages

• Reduces variance (especially for high-variance models like deep trees)
• Parallelizable training
• Simple and effective
• Out-of-bag error estimation

Best Practices

• Use unstable base learners (decision trees)
• More trees → better (no overfitting risk)
• Typical: 50-500 trees
• Each bootstrap ~63% unique samples

Random Forest

Bagging + Random Feature Selection

Random Forest extends bagging by adding randomness at each tree split:

Modified Tree Growing:

• At each node, randomly select features (subset of total)
• Choose best split only from these features
• Typical: (classification) or (regression)
• Repeat for each node independently

Why It Works

• Decorrelates trees: Without feature randomness, trees in bagging are still correlated (same strong features dominate)

• Increases diversity: Forces each tree to consider different features

• Ensemble benefit: Lower correlation → better variance reduction

Key Properties

• Very few hyperparameters to tune
• Robust to outliers and noise
• Handles mixed feature types well
• Provides feature importance estimates
• OOB error for free validation

Boosting

Sequential Learning from Mistakes

Boosting trains models sequentially, each focusing on correcting previous errors:

F(x) = \sum_{t=1}^{T} \alpha_t h_t(x)

where are weak learners, are weights (higher for better models)

AdaBoost

Adaptive Boosting - adjusts sample weights:

Algorithm:

1. Initialize weights:
2. For to :
• Train on weighted data
• Compute weighted error
• Set weight based on error rate
• Update weights (increase for errors)
3. Final:

Gradient Boosting

Fit new model to negative gradient of loss:

Algorithm:

1. Initialize:
2. For to :
• Compute residuals/pseudo-residuals:
• Fit to predict
• Update:
3. Output:

Key Differences: Bagging vs Boosting

Bagging:

• Parallel training
• Equal weights
• Reduces variance
• Less prone to overfitting

Boosting:

• Sequential training
• Weighted combination
• Reduces bias
• Can overfit if too many rounds

Modern Ensemble Methods

XGBoost

Extreme Gradient Boosting - optimized implementation:

Key Features:

• Regularization: penalty on leaf weights
• Parallelization: Column block for parallel tree construction
• Tree Pruning: Max depth, then prune back
• Missing Values: Learn best direction automatically
• Built-in CV: Cross-validation during training

🏆 Dominates ML competitions (Kaggle)

Stacking (Stacked Generalization)

Learn how to combine diverse base models:

Algorithm:

1. Level 0: Train diverse base models (SVM, RF, NN, etc.) using CV
2. Collect predictions: For each sample, get predictions from all base models
3. Level 1: Train meta-model using base predictions as features
4. Test: Base models predict → Meta-model combines

Meta-model learns optimal weighting (can be linear regression, logistic regression, or another ML model)

Other Notable Methods

LightGBM

Microsoft's fast gradient boosting. Uses histogram-based algorithms and GOSS sampling.

CatBoost

Yandex's boosting. Handles categorical features natively, reduces overfitting.

Voting

Simple: hard voting (majority) or soft voting (average probabilities).

Example: Credit Default Prediction

Comparing Ensemble Methods

Dataset

10,000 customers, 20 features (income, age, debt ratio, payment history...), binary target (default: yes/no)

Model Comparison

Model	Accuracy	AUC	Training Time
Single Decision Tree	82%	0.78	1s
Bagging (100 trees)	87%	0.89	15s
Random Forest	89%	0.92	18s
AdaBoost	88%	0.91	25s
Gradient Boosting	90%	0.93	40s
XGBoost	91%	0.94	12s

Observations

• All ensembles beat single tree significantly
• Random Forest: best balance of speed and accuracy
• XGBoost: highest performance, optimized speed
• Boosting methods generally outperform bagging (but slower training)

View More Examples & Exercises

Bagging: Variance Reduction Proof

Mathematical Justification for Bootstrap Aggregating

Setup

Train B models on bootstrap samples, then average:

\hat{f}(x) = \frac{1}{B}\sum_{b=1}^{B} h_b(x)

Variance Reduction (Independent Case)

Theorem: If models are independent with variance σ²:

\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} h_b\right) = \frac{1}{B^2}\sum_{b=1}^{B} \text{Var}(h_b) = \frac{B\sigma^2}{B^2} = \frac{\sigma^2}{B}

Variance reduced by factor of B!

Correlated Predictors (Realistic Case)

With correlation ρ between any two models:

\text{Var}(\bar{h}) = \frac{1}{B^2}\left[B\sigma^2 + B(B-1)\rho\sigma^2\right]

= \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2 \approx \frac{\sigma^2}{B} + \rho\sigma^2

Key Insight:

• First term (1/B): vanishes as B→∞
• Second term (ρ): irreducible, depends on correlation
• Lower correlation = better variance reduction!

Bootstrap Statistics

Probability a sample appears in bootstrap:

P(\text{sample } i \text{ selected}) = 1 - \left(1 - \frac{1}{n}\right)^n \to 1 - e^{-1} \approx 0.632

About 63% of original samples in each bootstrap, 37% out-of-bag (OOB) for validation.

AdaBoost: Weight Update Derivation

Exponential Loss Minimization

AdaBoost Algorithm

Initialize: w_i^(1) = 1/n for all samples

For t = 1 to T:

Train weak learner h_t on weighted data
Compute weighted error: $\epsilon_t = \sum_{i:h_t(x_i)\neq y_i} w_i^{(t)}$
Compute model weight: $\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}$
Update weights: $w_i^{(t+1)} = w_i^{(t)} \exp(-\alpha_t y_i h_t(x_i))$
Normalize weights to sum to 1

Final: $H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right)$

Why These Formulas? Exponential Loss

AdaBoost minimizes exponential loss:

L = \sum_{i=1}^{n} \exp\left(-y_i \sum_{t=1}^{T} \alpha_t h_t(x_i)\right)

Greedy Optimization (stage t):

Given fixed α_1,...,α_(t-1), h_1,...,h_(t-1), choose α_t and h_t to minimize:

\sum_i w_i^{(t)} \exp(-\alpha_t y_i h_t(x_i))

where w_i^(t) = exp(-y_i Σ_(s<t) α_s h_s(x_i)) are previous round weights

Derivation of α_t Formula

Split sum into correct and incorrect predictions:

L_t = \sum_{y_i=h_t(x_i)} w_i e^{-\alpha_t} + \sum_{y_i\neq h_t(x_i)} w_i e^{\alpha_t}

= (1-\epsilon_t)e^{-\alpha_t} + \epsilon_t e^{\alpha_t}

Take derivative and set to zero:

\frac{dL_t}{d\alpha_t} = -(1-\epsilon_t)e^{-\alpha_t} + \epsilon_t e^{\alpha_t} = 0

\frac{\epsilon_t}{1-\epsilon_t} = e^{-2\alpha_t}

\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}

Training Error Bound

Theorem: Training error of final classifier:

\frac{1}{n}\sum_i \mathbb{1}[H(x_i) \neq y_i] \leq \prod_{t=1}^{T} 2\sqrt{\epsilon_t(1-\epsilon_t)}

If each weak learner has ε_t < 0.5, this product → 0 exponentially fast!

Gradient Boosting Mathematical Framework

Functional Gradient Descent

Framework

Goal: Minimize loss L over functions F(x):

\min_F \sum_{i=1}^{n} L(y_i, F(x_i))

Build F iteratively: F_t = F_(t-1) + ν h_t, where h_t is chosen to reduce loss

Pseudo-Residual (Negative Gradient)

At iteration t, compute negative gradient (pseudo-residuals):

r_{it} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\Bigg|_{F=F_{t-1}}

Examples:

• Squared loss: r_it = y_i - F_(t-1)(x_i) (actual residuals!)
• Absolute loss: r_it = sign(y_i - F_(t-1)(x_i))
• Huber loss: Clipped residuals

Gradient Boosting Algorithm

Initialize: F_0(x) = argmin_γ Σ L(y_i, γ)

For t = 1 to T:

1. Compute pseudo-residuals r_it for all i

2. Fit base learner h_t to [(x_i, r_it)]

3. Find optimal step size:

ν_t = argmin_ν Σ L(y_i, F_(t-1)(x_i) + ν h_t(x_i))

4. Update: F_t(x) = F_(t-1)(x) + ν_t h_t(x)

Return: F_T(x)

Shrinkage (Learning Rate)

Regularization: Scale updates by learning rate η ∈ (0,1]:

F_t(x) = F_{t-1}(x) + \eta \cdot \nu_t h_t(x)

Smaller η → more iterations needed but better generalization. Typical: η = 0.01-0.1

XGBoost Innovations

Second-order Taylor approximation:

L(y, F + h) \approx L(y, F) + g_i h(x_i) + \frac{1}{2}h_i^2 h(x_i)^2

where g_i = ∂L/∂F, h_i = ∂²L/∂F². Uses Newton boosting for faster convergence.

Random Forest: Decorrelation Proof

Feature Randomization Reduces Correlation

Problem with Standard Bagging

If one feature is very strong, all trees will use it at top split → high correlation ρ

Recall: Var(ensemble) ≈ σ²/B + ρσ². High ρ limits variance reduction!

Random Forest Solution

At each split, randomly sample m features (out of p total):

• Typical: m = √p for classification
• Typical: m = p/3 for regression

Effect on Correlation:

If one feature dominates, probability it's available at any split: m/p

Trees forced to consider different features → lower correlation between trees

Theoretical Analysis

Expected correlation for Random Forest:

\mathbb{E}[\rho_{\text{RF}}] \approx \frac{m}{p} \cdot \rho_{\text{bagging}}

For m=√p and p=100: correlation reduced by factor of √100 = 10!

Bias-Variance Tradeoff in m

Small m (high randomness)

• Lower correlation ✓
• Lower variance ✓
• Higher bias (weaker trees) ✗

Large m (less randomness)

• Higher correlation ✗
• Higher variance ✗
• Lower bias (stronger trees) ✓

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

What is the main idea behind ensemble learning?

Not attempted

2

In bagging (Bootstrap Aggregating), each base model is trained on:

Not attempted

3

Random Forest is an extension of bagging that adds:

Not attempted

4

What does boosting do differently from bagging?

Not attempted

5

In AdaBoost, after iteration , how are sample weights updated for correctly classified sample ?

Not attempted

6

Gradient Boosting builds the ensemble by:

Not attempted

7

What is 'Out-of-Bag' (OOB) error in Random Forest?

Not attempted

8

Which ensemble method typically has the highest variance reduction?

Not attempted

9

What makes XGBoost faster and more effective than basic Gradient Boosting?

Not attempted

10

In stacking, how is the meta-model (level-1 model) trained?

Not attempted

Ensemble Learning

Why Ensembles Work

Mathematical Intuition

Bagging (Bootstrap Aggregating)

Algorithm

Advantages

Best Practices

Random Forest

Modified Tree Growing:

Why It Works

Key Properties

Boosting

Key Differences: Bagging vs Boosting

Modern Ensemble Methods

Example: Credit Default Prediction

Dataset

Model Comparison

Observations

Bagging: Variance Reduction Proof

Setup

Variance Reduction (Independent Case)

Correlated Predictors (Realistic Case)

Bootstrap Statistics

AdaBoost: Weight Update Derivation

AdaBoost Algorithm

Why These Formulas? Exponential Loss

Derivation of α_t Formula

Training Error Bound

Gradient Boosting Mathematical Framework

Framework

Pseudo-Residual (Negative Gradient)

Gradient Boosting Algorithm

Shrinkage (Learning Rate)

XGBoost Innovations

Random Forest: Decorrelation Proof

Problem with Standard Bagging

Random Forest Solution

Theoretical Analysis

Bias-Variance Tradeoff in m

Practice Quiz