Machine Learning/Learning Center/Rule Learning/Pruning Optimization

Pruning Optimization

Master techniques to improve rule generalization. Learn pre-pruning and post-pruning methods to remove redundant literals and prevent overfitting.

Module 4 of 7

Intermediate Level

120-150 min

Core Goal: Balance Accuracy and Simplicity

Pruning removes redundant literals from rules to improve generalization. The goal is to find the optimal trade-off between rule accuracy (on training data) and rule simplicity (number of literals), ensuring rules perform well on unseen data.

Why Prune?

• Prevent overfitting: Remove literals that only help on training data
• Improve generalization: Simpler rules often generalize better
• Enhance interpretability: Fewer literals make rules easier to understand
• Reduce complexity: Simpler rules are faster to evaluate

Pre-Pruning

Pre-pruning sets constraints during rule generation to prevent over-complex rules from being learned in the first place. It stops rule specialization before rules become too specific.

Common Pre-Pruning Constraints

Maximum Rule Length

Stop adding literals when rule body contains more than $k$ literals (e.g., $k = 5$ ). Prevents overly complex rules.

Minimum Accuracy Threshold

Stop adding literals when rule accuracy exceeds threshold (e.g., 95%). Prevents unnecessary specialization.

Minimum Coverage Threshold

Stop adding literals when rule coverage drops below threshold (e.g., 10 samples). Ensures rules cover sufficient examples.

Likelihood Ratio Statistics

A statistical test to determine if adding a literal significantly improves rule quality beyond random chance:

\chi^2 = 2 \sum_{i} n_i \log \frac{n_i}{e_i}

Where $n_i$ is observed frequency and $e_i$ is expected frequency under null hypothesis. If $\chi^2$ exceeds critical value, the literal is significant and should be kept.

Example:

If adding literal "contains 'free'" to spam rule gives $\chi^2 = 15.2$ (critical value = 3.84 for 95% confidence), the literal is statistically significant and should be kept.

Post-Pruning

Post-pruning generates a complete rule first, then removes redundant literals based on validation set performance. This allows the rule to be fully specialized before simplification.

REP (Reduced Error Pruning)

Exhaustively tries all possible pruning operations (remove each literal, remove rule entirely) and selects the version with highest validation set accuracy.

Algorithm Steps:

1. Generate complete rule: Learn rule $r$ with all literals
2. Generate candidates: Create all possible pruned versions:
- • Remove each literal: $r \setminus \{l_i\}$ for each $l_i$
- • Remove entire rule: $\emptyset$
3. Evaluate on validation set: Compute accuracy for each candidate
4. Select best: Return candidate with highest validation accuracy

Example:

Rule: $\text{spam} \leftarrow (\text{contains 'free'}) \land (\text{contains 'click'}) \land (\text{has\_links} > 5)$

Candidates: Remove 'free' (accuracy: 85%), Remove 'click' (accuracy: 88%), Remove 'has_links' (accuracy: 82%), Keep all (accuracy: 80%). Select: Remove 'click' version with 88% accuracy.

IREP (Incremental Reduced Error Pruning)

Prunes each rule immediately after it's learned (during sequential covering), preventing accumulation of overfitted rules. More efficient than REP.

Algorithm Steps:

1. Learn rule: Generate complete rule $r$ using single rule learning
2. Split data: Divide training set into grow set (2/3) and prune set (1/3)
3. Prune rule: Apply REP on prune set to get pruned rule $r'$
4. Add to rule set: Add $r'$ to rule set and remove covered samples
5. Repeat: Continue until no more rules can be learned

Advantages:

• Prunes rules before they accumulate
• More efficient than pruning entire rule set at once
• Prevents overfitting early in the process

IREP* (Improved IREP)

Enhancement of IREP with improved pruning threshold and rule evaluation logic. Uses more sophisticated criteria for when to stop pruning.

Key Improvements:

• Better pruning threshold: Uses $\frac{p - n}{p + n}$ where $p$ = positives, $n$ = negatives on prune set
• Optimized evaluation: Considers both accuracy and coverage in pruning decisions
• Early stopping: Stops pruning when no improvement is possible

Pruning Criterion:

\text{prune if: } \frac{p - n}{p + n} < \text{threshold}

If the pruned rule's score falls below threshold, keep the unpruned version.

Email Spam Detection Example

Apply pruning to spam detection rules. Learn a rule and compare unpruned vs pruned versions on validation set.

Pruning Comparison

Method	Validation Accuracy	Rule Complexity	Generalization
No Pruning	85%	High (5 literals)	Poor
Pre-Pruning	82%	Medium (3 literals)	Good
REP	88%	Low (2 literals)	Excellent
IREP*	90%	Low (2 literals)	Excellent

IREP* achieves best balance: high validation accuracy (90%) with low complexity (2 literals) and excellent generalization.

Pruning Process Example

Unpruned Rule:

spam ← (contains 'free') ∧ (contains 'click') ∧ (has_links > 5) ∧ (sender_unknown) ∧ (urgent_subject)

Training accuracy: 95%, Validation accuracy: 80% (overfitting!)

REP Pruning Steps:

1. Remove 'urgent_subject': Validation accuracy = 82% ✓
2. Remove 'sender_unknown': Validation accuracy = 85% ✓
3. Remove 'has_links > 5': Validation accuracy = 88% ✓
4. Remove 'click': Validation accuracy = 85% ✗ (worse, revert)

Final Pruned Rule:

spam ← (contains 'free') ∧ (has_links > 5)

Validation accuracy: 88% (improved from 80%), Rule complexity: 2 literals (reduced from 5), Better generalization on unseen emails.

Advantages and Limitations

Advantages

• Improves generalization: Better performance on unseen data
• Reduces overfitting: Removes noise-specific literals
• Enhances interpretability: Simpler rules are easier to understand
• Validation-based: Uses independent data for pruning decisions

Limitations

• Requires validation set: Needs separate data for pruning
• Computational cost: REP tries all pruning combinations
• May over-prune: Could remove useful literals
• Threshold selection: Pruning criteria need tuning

Next Module