MathIsimple

Bagging & Random Forest

Master parallel ensemble learning: how Bagging reduces variance through bootstrap sampling and how Random Forest adds double randomness for even better performance

What is Bagging?

Parallel Ensemble

Bagging (Bootstrap Aggregating) is a parallel ensemble learning method where individual learners are generated independently and in parallel using different bootstrap samples of the training data. Unlike Boosting, there's no dependency between learners.

Core Bagging Principle

Bagging works by:

  1. 1.Creating T different bootstrap samples from the training data
  2. 2.Training T independent base learners, one on each bootstrap sample
  3. 3.Combining predictions using voting (classification) or averaging (regression)

Key Advantage

Bagging primarily reduces variance by averaging out the variability across different training samples. This is especially effective for unstable learners like decision trees.

Parallelization

Since learners are independent, Bagging is highly parallelizable. All T learners can be trained simultaneously, making it computationally efficient.

Bootstrap Sampling

Bootstrap sampling is the core mechanism that creates diversity in Bagging. Here's how it works:

Bootstrap Sampling Process

  1. 1.
    Start with training set D containing m samples: D = {x₁, x₂, ..., xₘ}
  2. 2.
    Randomly sample with replacement m times from D to create bootstrap sample Dₜ
  3. 3.
    Result: Each bootstrap sample contains approximately 63.2% unique samples, with some samples appearing multiple times and others not appearing at all

Key Properties

  • Same size: Each bootstrap sample has m samples (same as original)
  • With replacement: Same sample can appear multiple times
  • Unique samples: ~63.2% of original samples appear in each bootstrap
  • OOB samples: ~36.8% are "out-of-bag" (not in bootstrap sample)

Why 63.2%?

The probability that a specific sample is not selected in m draws:

P(not selected) = (1 - 1/m)^m

As m → ∞: (1 - 1/m)^m → 1/e ≈ 0.368

So ~36.8% are OOB, meaning ~63.2% are in the bootstrap sample

Example: Bootstrap Sampling (8 samples, 3 trees)

Tree 1
OOB: Sample 2

Bootstrap samples: [1, 1, 3, 4, 5, 6, 7, 8]

Sample 2 is not in this bootstrap sample, so it's OOB for Tree 1

Tree 2
OOB: Sample 1

Bootstrap samples: [2, 2, 3, 4, 5, 6, 7, 8]

Sample 1 is not in this bootstrap sample, so it's OOB for Tree 2

Tree 3
OOB: Sample 7

Bootstrap samples: [1, 2, 3, 3, 4, 5, 6, 8]

Sample 7 is not in this bootstrap sample, so it's OOB for Tree 3

Bagging Algorithm

The complete Bagging algorithm is straightforward:

Bagging Algorithm Steps

  1. 1.
    For t = 1 to T:
    • a. Create bootstrap sample Dₜ by sampling m samples with replacement from D
    • b. Train base learner hₜ on Dₜ
  2. 2.
    For classification: H(x) = argmax_c Σₜ [hₜ(x) = c] (majority voting)
  3. 3.
    For regression: H(x) = (1/T) Σₜ hₜ(x) (simple averaging)

Advantages

  • Low time complexity (same as single learner)
  • Highly parallelizable
  • Reduces variance effectively
  • Supports OOB estimation (no separate validation set needed)

Best For

  • Unstable learners (decision trees, neural networks)
  • High-variance models
  • Large datasets
  • When parallel computing is available

Out-of-Bag (OOB) Estimation

One of Bagging's key advantages is that it provides a built-in validation mechanism without needing a separate validation set:

OOB Estimation Process

  1. 1.
    For each sample xᵢ, identify which trees did not use xᵢ in training (OOB trees)
  2. 2.
    Use only OOB trees to predict xᵢ (since xᵢ wasn't in their training data, this is like a test prediction)
  3. 3.
    Aggregate OOB predictions across all samples to estimate ensemble performance

Advantages of OOB Estimation

  • No separate validation set needed: Use all data for training, still get unbiased performance estimate
  • Computationally efficient: No need to hold out data or do cross-validation
  • Unbiased estimate: OOB samples are truly unseen by their corresponding trees
  • Feature importance: Can compute feature importance using OOB predictions

Random Forest

Random Forest is an extension of Bagging that adds an additional layer of randomness for even better performance:

Double Randomness in Random Forest

1. Sample Randomness (from Bagging)

Each tree is trained on a different bootstrap sample of the training data.

Example: Tree 1 uses samples [1,1,3,4,5,6,7,8], Tree 2 uses [2,2,3,4,5,6,7,8]

2. Attribute Randomness (Random Forest addition)

At each node split, randomly select K attributes from all d attributes, then choose the best split from these K attributes (not from all d attributes).

Example: If d=10 features, randomly select K=√10≈3 features at each node, then choose best split from these 3 (not from all 10)

Why Random Forest Works Better

  • More diversity: Attribute randomness increases diversity between trees
  • Reduced correlation: Trees are less correlated, improving ensemble performance
  • Better generalization: Experimental results show lower test error than Bagging

Typical K Values

  • K = √d: Common choice for classification (d = number of features)
  • K = d/3: Alternative for classification
  • K = d: For regression (no attribute randomness, same as Bagging)
  • K = log₂(d): For very high-dimensional data

Random Forest Algorithm

  1. 1.
    For t = 1 to T:
    • a. Create bootstrap sample Dₜ
    • b. Train decision tree hₜ on Dₜ with attribute randomness:
      • • At each node, randomly select K attributes
      • • Choose best split from these K attributes
  2. 2.
    Output: H(x) = (1/T) Σₜ hₜ(x) for regression, or majority vote for classification

How Bagging Reduces Variance

Bagging's primary benefit is variance reduction. Here's the mathematical intuition:

Variance Reduction Theory

For regression, if individual learners have variance σ² and correlation ρ:

Var(H) = (σ²/T) × (1 + (T-1)ρ)

Where H is the ensemble prediction, T is the number of learners

If ρ = 0 (independent learners): Var(H) = σ²/T

Variance decreases linearly with T. With 10 independent learners, variance is 1/10 of individual variance.

If ρ = 1 (perfectly correlated): Var(H) = σ²

No variance reduction. This is why diversity (low correlation) is crucial.

If ρ = 0.3 (moderate correlation): Var(H) ≈ 0.37σ² (with T=10)

Significant variance reduction, but not as much as with independent learners.

Why This Matters

  • Unstable learners benefit most: Decision trees have high variance (small data changes → different trees). Bagging averages this out.
  • Stable learners benefit less: Linear models have low variance, so Bagging helps less (but can still help with bias).
  • Random Forest adds more diversity: Attribute randomness reduces correlation further, improving variance reduction.

Example: Housing Price Prediction with Bagging

A real estate company uses Bagging with 5 regression trees to predict house prices:

IDSqftBedroomsBathroomsLocationYearPrice
112002172005$285,000
21800328.52010$425,000
3240042.592015$575,000
415003262000$325,000
521003282012$485,000
6950215.51998$225,000
72800439.52018$695,000
81650327.52008$385,000

Bagging Setup:

  • 5 regression trees, each trained on a different bootstrap sample
  • • Each tree makes independent predictions
  • • Final prediction = average of 5 tree predictions

Example Prediction (House ID 3):

Tree 1 prediction: $570,000

Tree 2 prediction: $580,000

Tree 3 prediction: $575,000

Tree 4 prediction: $565,000

Tree 5 prediction: $580,000

Ensemble prediction (average): $574,000

The ensemble prediction is more stable than any individual tree prediction

Example: Wine Quality Classification with Random Forest

A wine quality assessment system uses Random Forest with 10 trees to classify wine quality:

IDAlcohol (%)AciditypHResidual SugarQuality
19.50.73.22.5Good
211.20.53.41.8Excellent
38.80.93.13.2Fair
412.10.43.51.5Excellent
59.20.83.32.8Good
610.50.63.42.1Good
711.80.453.51.6Excellent
890.853.23Fair

Random Forest Setup:

  • 10 decision trees, each on a different bootstrap sample
  • Attribute randomness: At each node, randomly select K=2 features from 4 total features
  • Final prediction: Majority vote across 10 trees

Example Prediction (Wine ID 2):

Votes for "Excellent": 7 trees

Votes for "Good": 2 trees

Votes for "Fair": 1 tree

Ensemble prediction (majority vote): Excellent

High confidence (7/10 trees agree) due to diverse trees making consistent predictions

Frequently Asked Questions

Q: How many trees should I use in Bagging/Random Forest?

A: Typically 100-500 trees work well. More trees generally improve performance but with diminishing returns. Beyond 500 trees, improvements are usually marginal. Use OOB error to monitor when adding more trees stops helping.

Q: When should I use Bagging vs Random Forest?

A: Random Forest is almost always better than Bagging for decision trees because it adds attribute randomness for more diversity. Use Bagging if you want to ensemble non-tree models (e.g., neural networks) where attribute randomness doesn't apply.

Q: Can I use OOB error for model selection?

A: Yes! OOB error is an unbiased estimate of generalization error. You can use it to select the number of trees, tree depth, or other hyperparameters without needing a separate validation set.

Q: Why does Random Forest work better than a single deep tree?

A: A single deep tree overfits and has high variance. Random Forest averages many shallow trees, reducing variance while maintaining good fit. The ensemble is more robust and generalizes better to new data.