Master strategies for learning individual rules. Learn top-down specialization, bottom-up generalization, evaluation metrics, and beam search for optimal rule discovery.
Learning a single rule requires searching through a space of possible logical literals. With attributes, each with possible values, the search space can be exponentially large. Efficient search strategies and evaluation metrics are essential to find good rules.
For a rule with up to literals, the number of possible rules is:
Where is the total number of possible literals. This grows exponentially with , making exhaustive search infeasible.
Two main approaches to search the rule space: top-down (general to specific) and bottom-up (specific to general).
Start with the most general rule (empty body, covers all samples) and gradually add literals to specialize the rule, reducing coverage but increasing accuracy.
Algorithm Steps:
Example Progression:
→ →
Start with a specific rule (covers a single positive example) and gradually remove literals to generalize the rule, increasing coverage while maintaining accuracy.
Algorithm Steps:
Example Progression:
→
Top-Down:
Bottom-Up:
Evaluation metrics determine which candidate rule is best when adding or removing literals. Common metrics balance accuracy and coverage.
Proportion of covered samples that are positive examples:
Higher accuracy means fewer negative examples are covered. A rule with 100% accuracy covers only positive examples.
Measures the reduction in entropy (uncertainty) when adding a literal to the rule:
Where is the entropy of the dataset before adding literal , and is the entropy of samples covered by rule .
Entropy Formula:
Where is the proportion of class in dataset .
Measures the impurity of covered samples. Lower Gini means higher purity (more homogeneous):
Gini ranges from 0 (pure, all same class) to 0.5 (maximum impurity, equal classes). A rule with low Gini coefficient is preferred.
Instead of keeping only the best candidate rule at each step (greedy), beam searchmaintains the top- candidate rules, expanding the search space and reducing the risk of getting stuck in local optima.
Advantages:
Apply top-down search to learn a rule predicting customer churn. Training set: 100 customers (60 churn, 40 stay). Use beam search with .
| Step | Rule | Coverage | Accuracy | Description |
|---|---|---|---|---|
| 1 | churn ← . | All (100) | 60% | Most general rule |
| 2 | churn ← (age < 30) | 40 | 75% | Add age condition |
| 3 | churn ← (age < 30) ∧ (usage < 5hrs) | 25 | 88% | Add usage condition |
| 4 | churn ← (age < 30) ∧ (usage < 5hrs) ∧ (support_tickets > 2) | 15 | 93% | Final specialized rule |
At each step, beam search maintains top-3 candidate rules. Step 4 rule achieves 93% accuracy with 15 samples covered, meeting the accuracy threshold.
Current rule:
Coverage: 25 samples (22 churn, 3 stay)
Information gain:
Adding usage condition increases accuracy from 75% to 88% with significant information gain.
This rule identifies young customers with low usage and high support tickets as likely to churn, with 93% accuracy covering 15% of the dataset.