Master parallel ensemble learning: how Bagging reduces variance through bootstrap sampling and how Random Forest adds double randomness for even better performance
Bagging (Bootstrap Aggregating) is a parallel ensemble learning method where individual learners are generated independently and in parallel using different bootstrap samples of the training data. Unlike Boosting, there's no dependency between learners.
Bagging works by:
Bagging primarily reduces variance by averaging out the variability across different training samples. This is especially effective for unstable learners like decision trees.
Since learners are independent, Bagging is highly parallelizable. All T learners can be trained simultaneously, making it computationally efficient.
Bootstrap sampling is the core mechanism that creates diversity in Bagging. Here's how it works:
The probability that a specific sample is not selected in m draws:
P(not selected) = (1 - 1/m)^m
As m → ∞: (1 - 1/m)^m → 1/e ≈ 0.368
So ~36.8% are OOB, meaning ~63.2% are in the bootstrap sample
Bootstrap samples: [1, 1, 3, 4, 5, 6, 7, 8]
Sample 2 is not in this bootstrap sample, so it's OOB for Tree 1
Bootstrap samples: [2, 2, 3, 4, 5, 6, 7, 8]
Sample 1 is not in this bootstrap sample, so it's OOB for Tree 2
Bootstrap samples: [1, 2, 3, 3, 4, 5, 6, 8]
Sample 7 is not in this bootstrap sample, so it's OOB for Tree 3
The complete Bagging algorithm is straightforward:
One of Bagging's key advantages is that it provides a built-in validation mechanism without needing a separate validation set:
Random Forest is an extension of Bagging that adds an additional layer of randomness for even better performance:
Each tree is trained on a different bootstrap sample of the training data.
Example: Tree 1 uses samples [1,1,3,4,5,6,7,8], Tree 2 uses [2,2,3,4,5,6,7,8]
At each node split, randomly select K attributes from all d attributes, then choose the best split from these K attributes (not from all d attributes).
Example: If d=10 features, randomly select K=√10≈3 features at each node, then choose best split from these 3 (not from all 10)
Bagging's primary benefit is variance reduction. Here's the mathematical intuition:
For regression, if individual learners have variance σ² and correlation ρ:
Var(H) = (σ²/T) × (1 + (T-1)ρ)
Where H is the ensemble prediction, T is the number of learners
If ρ = 0 (independent learners): Var(H) = σ²/T
Variance decreases linearly with T. With 10 independent learners, variance is 1/10 of individual variance.
If ρ = 1 (perfectly correlated): Var(H) = σ²
No variance reduction. This is why diversity (low correlation) is crucial.
If ρ = 0.3 (moderate correlation): Var(H) ≈ 0.37σ² (with T=10)
Significant variance reduction, but not as much as with independent learners.
A real estate company uses Bagging with 5 regression trees to predict house prices:
| ID | Sqft | Bedrooms | Bathrooms | Location | Year | Price |
|---|---|---|---|---|---|---|
| 1 | 1200 | 2 | 1 | 7 | 2005 | $285,000 |
| 2 | 1800 | 3 | 2 | 8.5 | 2010 | $425,000 |
| 3 | 2400 | 4 | 2.5 | 9 | 2015 | $575,000 |
| 4 | 1500 | 3 | 2 | 6 | 2000 | $325,000 |
| 5 | 2100 | 3 | 2 | 8 | 2012 | $485,000 |
| 6 | 950 | 2 | 1 | 5.5 | 1998 | $225,000 |
| 7 | 2800 | 4 | 3 | 9.5 | 2018 | $695,000 |
| 8 | 1650 | 3 | 2 | 7.5 | 2008 | $385,000 |
Tree 1 prediction: $570,000
Tree 2 prediction: $580,000
Tree 3 prediction: $575,000
Tree 4 prediction: $565,000
Tree 5 prediction: $580,000
Ensemble prediction (average): $574,000
The ensemble prediction is more stable than any individual tree prediction
A wine quality assessment system uses Random Forest with 10 trees to classify wine quality:
| ID | Alcohol (%) | Acidity | pH | Residual Sugar | Quality |
|---|---|---|---|---|---|
| 1 | 9.5 | 0.7 | 3.2 | 2.5 | Good |
| 2 | 11.2 | 0.5 | 3.4 | 1.8 | Excellent |
| 3 | 8.8 | 0.9 | 3.1 | 3.2 | Fair |
| 4 | 12.1 | 0.4 | 3.5 | 1.5 | Excellent |
| 5 | 9.2 | 0.8 | 3.3 | 2.8 | Good |
| 6 | 10.5 | 0.6 | 3.4 | 2.1 | Good |
| 7 | 11.8 | 0.45 | 3.5 | 1.6 | Excellent |
| 8 | 9 | 0.85 | 3.2 | 3 | Fair |
Votes for "Excellent": 7 trees
Votes for "Good": 2 trees
Votes for "Fair": 1 tree
Ensemble prediction (majority vote): Excellent
High confidence (7/10 trees agree) due to diverse trees making consistent predictions
A: Typically 100-500 trees work well. More trees generally improve performance but with diminishing returns. Beyond 500 trees, improvements are usually marginal. Use OOB error to monitor when adding more trees stops helping.
A: Random Forest is almost always better than Bagging for decision trees because it adds attribute randomness for more diversity. Use Bagging if you want to ensemble non-tree models (e.g., neural networks) where attribute randomness doesn't apply.
A: Yes! OOB error is an unbiased estimate of generalization error. You can use it to select the number of trees, tree depth, or other hyperparameters without needing a separate validation set.
A: A single deep tree overfits and has high variance. Random Forest averages many shallow trees, reducing variance while maintaining good fit. The ensemble is more robust and generalizes better to new data.