Master techniques to improve rule generalization. Learn pre-pruning and post-pruning methods to remove redundant literals and prevent overfitting.
Pruning removes redundant literals from rules to improve generalization. The goal is to find the optimal trade-off between rule accuracy (on training data) and rule simplicity (number of literals), ensuring rules perform well on unseen data.
Pre-pruning sets constraints during rule generation to prevent over-complex rules from being learned in the first place. It stops rule specialization before rules become too specific.
Stop adding literals when rule body contains more than literals (e.g., ). Prevents overly complex rules.
Stop adding literals when rule accuracy exceeds threshold (e.g., 95%). Prevents unnecessary specialization.
Stop adding literals when rule coverage drops below threshold (e.g., 10 samples). Ensures rules cover sufficient examples.
A statistical test to determine if adding a literal significantly improves rule quality beyond random chance:
Where is observed frequency and is expected frequency under null hypothesis. If exceeds critical value, the literal is significant and should be kept.
Example:
If adding literal "contains 'free'" to spam rule gives (critical value = 3.84 for 95% confidence), the literal is statistically significant and should be kept.
Post-pruning generates a complete rule first, then removes redundant literals based on validation set performance. This allows the rule to be fully specialized before simplification.
Exhaustively tries all possible pruning operations (remove each literal, remove rule entirely) and selects the version with highest validation set accuracy.
Algorithm Steps:
Example:
Rule:
Candidates: Remove 'free' (accuracy: 85%), Remove 'click' (accuracy: 88%), Remove 'has_links' (accuracy: 82%), Keep all (accuracy: 80%). Select: Remove 'click' version with 88% accuracy.
Prunes each rule immediately after it's learned (during sequential covering), preventing accumulation of overfitted rules. More efficient than REP.
Algorithm Steps:
Advantages:
Enhancement of IREP with improved pruning threshold and rule evaluation logic. Uses more sophisticated criteria for when to stop pruning.
Key Improvements:
Pruning Criterion:
If the pruned rule's score falls below threshold, keep the unpruned version.
Apply pruning to spam detection rules. Learn a rule and compare unpruned vs pruned versions on validation set.
| Method | Validation Accuracy | Rule Complexity | Generalization |
|---|---|---|---|
| No Pruning | 85% | High (5 literals) | Poor |
| Pre-Pruning | 82% | Medium (3 literals) | Good |
| REP | 88% | Low (2 literals) | Excellent |
| IREP* | 90% | Low (2 literals) | Excellent |
IREP* achieves best balance: high validation accuracy (90%) with low complexity (2 literals) and excellent generalization.
spam ← (contains 'free') ∧ (contains 'click') ∧ (has_links > 5) ∧ (sender_unknown) ∧ (urgent_subject)
Training accuracy: 95%, Validation accuracy: 80% (overfitting!)
spam ← (contains 'free') ∧ (has_links > 5)
Validation accuracy: 88% (improved from 80%), Rule complexity: 2 literals (reduced from 5), Better generalization on unseen emails.