Machine Learning/Learning Center/Decision Trees/Pruning Techniques

Pruning Techniques

Learn how to prevent overfitting and improve generalization through pre-pruning and post-pruning strategies for decision trees

Why Prune Decision Trees?

Overfitting Prevention

Without constraints, decision trees grow until they perfectly classify every training sample—memorizing noise, outliers, and irrelevant patterns. This leads to overfitting: excellent training accuracy but poor performance on new data. Pruning simplifies trees by removing branches that don't improve generalization, creating smaller, more robust models.

Unpruned Tree Problems

✗Overfitting: Learns training noise instead of real patterns
✗High variance: Small data changes cause large tree changes
✗Complex trees: Hundreds of nodes, hard to interpret
✗Slow prediction: Deep trees require many comparisons
✗Poor generalization: Fails on unseen data

Pruned Tree Benefits

✓Better generalization: Focuses on robust patterns
✓Lower variance: More stable across datasets
✓Simpler models: Easier to understand and visualize
✓Faster prediction: Fewer nodes to traverse
✓Higher test accuracy: Reduces overfitting gap

Two Pruning Approaches

Pre-Pruning (Early Stopping)

Stop growing the tree during construction based on criteria like max depth, minimum samples, or insufficient information gain.

Post-Pruning (Backward Pruning)

Grow a full tree first, then remove branches bottom-up if removal improves validation performance or a complexity penalty.

Pre-Pruning (Early Stopping)

Prevent overfitting during tree construction

Pre-pruning halts tree growth before it becomes too complex by setting thresholds on various criteria. Before expanding a node, the algorithm checks if splitting would violate any constraint. If so, the node becomes a leaf. This is computationally efficient but risks underfitting.

Common Pre-Pruning Criteria

1. Maximum Depth (max_depth)

Limits how deep the tree can grow from root to leaves. Prevents excessively deep trees that memorize training data.

Example:

max_depth = 5 → Tree stops growing after 5 levels
Typical range: 3-10 for interpretability, 10-30 for performance

Trade-off: Too shallow = underfitting; too deep = overfitting. Use cross-validation to tune.

2. Minimum Samples per Node (min_samples_split)

A node must contain at least this many samples to be eligible for splitting. Prevents creating splits on very small subsets.

Example:

min_samples_split = 20 → Node with 15 samples becomes a leaf
Typical: 2-50 absolute, or 0.01-0.1% of total samples

Benefit: Avoids overfitting to small, potentially noisy subsets of data.

3. Minimum Samples per Leaf (min_samples_leaf)

A split is only allowed if each resulting child node has at least this many samples. Prevents tiny leaves.

Example:

min_samples_leaf = 10 → Split creating a child with 8 samples is rejected
Typical: 1-20 absolute, or 0.001-0.05% of total

Smoothing effect: Ensures predictions are based on reasonable sample sizes.

4. Minimum Information Gain (min_impurity_decrease)

A split only occurs if it reduces impurity by at least this amount. Prevents splits that barely improve purity.

Example:

min_impurity_decrease = 0.01 → Split with gain 0.005 is rejected
Typical: 0.0-0.1 (relative to impurity scale)

Use case: Avoids splitting on noise when no feature provides meaningful separation.

5. Maximum Leaf Nodes (max_leaf_nodes)

Limits the total number of leaf nodes in the tree. Grows the tree in a best-first manner, prioritizing splits with highest impurity reduction.

Example:

max_leaf_nodes = 20 → Tree growth stops at 20 leaves
Typical: 10-100 depending on data complexity

Advantage: Direct control over model complexity; easier than tuning depth.

Pre-Pruning Trade-offs

Advantages:

Fast training—less tree growth
Lower memory usage
Simple to implement and understand
Multiple hyperparameters for fine control

Disadvantages:

Risk of underfitting (stopping too early)
Greedy—can't undo early decisions
May miss useful deeper patterns
Requires careful hyperparameter tuning

Post-Pruning (Backward Pruning)

Remove branches after building the full tree

Post-pruning first grows a complete tree (often unpruned or lightly constrained), then removes branches in a bottom-up fashion. For each internal node, the algorithm considers replacing the entire subtree with a leaf. If this simplification improves validation performance or a cost-complexity measure, the subtree is pruned.

Post-Pruning Methods

1. Reduced Error Pruning (REP)

The simplest post-pruning method. Requires a validation set. For each internal node (bottom-up), calculate validation accuracy if the subtree is replaced by a leaf. If accuracy doesn't decrease (or improves), prune the subtree.

Algorithm:

Grow a complete tree on training data
For each internal node (leaves first, moving up):
a. Temporarily replace subtree with leaf (majority class)
b. Calculate validation accuracy before and after
c. If accuracy doesn't decrease, make pruning permanent
Return pruned tree

Advantages:

• Very simple to understand and implement
• Fast computation
• Direct validation-based approach

Disadvantages:

• Requires separate validation set
• Less training data available
• Greedy (locally optimal decisions)

2. Cost-Complexity Pruning (Minimal Cost-Complexity Pruning)

Used by CART. Balances tree complexity with training error using a complexity parameter α (alpha). Generates a sequence of trees with increasing α, then selects the best via cross-validation.

Cost-Complexity Formula:

R_α(T) = R(T) + α × |T|

Where: R(T) = training error, |T| = number of leaf nodes, α = complexity penalty

Algorithm:

Grow a complete tree T₀
For increasing values of α:
a. Find subtree that minimizes R_α(T)
b. Prune that subtree → get T₁, T₂, ...
Use cross-validation to select optimal α
Prune using selected α on full training set

Intuition: Small α = large trees (low penalty), Large α = small trees (high penalty). α controls the trade-off between fit and complexity.

3. Error-Based Pruning (Used by C4.5)

Estimates error rate of each node using a pessimistic estimate (accounts for limited data). Prunes subtrees where the estimated error of a leaf is no worse than the weighted average of its children's errors.

Uses statistical confidence intervals to estimate error rates. More sophisticated than REP, doesn't require a separate validation set. For each node, calculates:

error(subtree) vs. error(leaf with majority class)
If error(leaf) ≤ error(subtree), prune the subtree

Benefit: More training data available (no validation split), but more complex to implement.

Post-Pruning Trade-offs

Advantages:

Lower underfitting risk—sees full tree first
Often better generalization than pre-pruning
Can reconsider earlier decisions
Theoretically more sound (cost-complexity)

Disadvantages:

Higher computational cost (grow then prune)
More memory needed for full tree
More complex to implement
REP requires validation set

Example: Credit Approval with Pruning

A bank builds a decision tree for credit approval. Let's see how pruning improves the model:

ID	Income	Credit Score	Debt	Age	Employment	Approved
1	$75k	720	$15k	35	8yr	Yes
2	$45k	650	$25k	28	3yr	No
3	$95k	780	$10k	42	15yr	Yes
4	$35k	590	$30k	25	1yr	No
5	$65k	700	$18k	38	10yr	Yes
6	$50k	680	$22k	32	5yr	No
7	$85k	750	$12k	45	12yr	Yes
8	$40k	620	$28k	30	2yr	No

Full dataset: 1000 loan applications. Training: 700 samples, Validation: 150, Test: 150.

Unpruned Tree

Depth: 12 levels

Leaf nodes: 87

Train accuracy: 99.7%

Test accuracy: 84.0%

Severe overfitting! Gap of 15.7% between train and test. Tree memorized training noise.

Pre-Pruned (max_depth=5)

Depth: 5 levels

Leaf nodes: 23

Train accuracy: 91.4%

Test accuracy: 88.7%

Better generalization! Small gap of 2.7%. Simpler tree, easier to interpret.

Post-Pruned (REP)

Depth: 7 levels

Leaf nodes: 31

Train accuracy: 93.1%

Test accuracy: 90.0%

Best test performance! Balanced complexity. Post-pruning found optimal tree depth automatically.

Key Observations

Post-pruning wins: 90.0% test accuracy vs 88.7% pre-pruned vs 84.0% unpruned
Complexity reduction: From 87 leaves to 31 (64% reduction) with better performance
Interpretability: Pruned trees are small enough for bank officers to understand
Production benefits: Faster prediction (fewer nodes), less memory, more robust

Pre-Pruning vs Post-Pruning: When to Use Which?

Aspect	Pre-Pruning	Post-Pruning
Training Speed	Fast (stops early)	Slow (grow + prune)
Memory Usage	Low (smaller trees)	High (full tree first)
Test Accuracy	Good (may underfit)	Better (optimal balance)
Complexity	Simple to implement	More complex
Hyperparameters	Many (depth, samples, etc.)	Fewer (often just α)
Underfitting Risk	Higher (greedy stops)	Lower (sees full tree)
Best For	Large datasets, fast iteration	Optimal performance, research

Practical Recommendations

Use Pre-Pruning When:

•Working with very large datasets (millions of samples)
•Training time or memory is limited
•You need fast experimentation and iteration
•Simplicity and interpretability are priorities
•Using scikit-learn (default approach)

Use Post-Pruning When:

•You need the best possible test performance
•Dataset is small/medium (computational cost is acceptable)
•You want theoretically sound pruning (cost-complexity)
•Research or competition setting (performance critical)
•You have validation data available (for REP)

In practice: Most modern implementations (scikit-learn, XGBoost, LightGBM) use pre-pruning as the default because it scales better. Post-pruning is often used in research or when squeezing out the last few percent of performance matters. You can also combine both approaches!

Frequently Asked Questions

Q: How do I choose the right max_depth for pre-pruning?

A: Use cross-validation! Try a range (e.g., 3, 5, 7, 10, 15, 20, None) and plot validation score vs depth. Look for the "elbow" where performance plateaus or starts decreasing. For interpretability, prefer shallower trees (3-7). For performance, deeper (10-20) is often better. Start with 5-10 as a reasonable default.

Q: Can I combine pre-pruning and post-pruning?

A: Yes! A common strategy is light pre-pruning (e.g., min_samples_leaf=5) to prevent extreme overfitting, then post-pruning for fine-tuning. This reduces computational cost while maintaining quality. However, in practice, one or the other is usually sufficient—combining both adds complexity without major benefits.

Q: What's the difference between min_samples_split and min_samples_leaf?

A: min_samples_split controls when you're allowed to split a node (must have at least N samples). min_samples_leaf controls what the children must have after splitting (each child needs at least M samples). They interact: if min_samples_split=10 and min_samples_leaf=5, you need ≥10 samples to consider splitting, and both children must get ≥5 each.

Q: Does pruning always improve test accuracy?

A: Not always, but usually. If the unpruned tree hasn't overfit (e.g., dataset is simple or noisy features were excluded), pruning may hurt slightly. However, with typical real-world messy data, pruning improves test accuracy 90% of the time. The key is using validation data or cross-validation to decide how much to prune—don't prune blindly.

Q: How does pruning affect Random Forests?

A: Random Forests typically use unpruned or lightly pruned trees! Each individual tree is allowed to overfit, but averaging many diverse overfitted trees reduces variance. The randomness (random feature subsets) provides the regularization instead of pruning. That said, setting reasonable min_samples_leaf (e.g., 2-5) is still common to prevent extreme memorization.

Key Takeaways

Pruning prevents overfitting by simplifying trees, improving generalization

Pre-pruning stops growth early using criteria like max_depth, min_samples

Post-pruning grows full tree then removes branches using validation or complexity cost

Pre-pruning: faster, simpler, risk of underfitting; Post-pruning: better accuracy, more complex

Reduced Error Pruning (REP): simple, validation-based bottom-up pruning

Cost-Complexity Pruning: CART's method balancing error and tree size with α parameter

Use cross-validation to tune pruning hyperparameters, avoid blindly guessing values

Modern libraries default to pre-pruning for scalability; post-pruning for optimal performance