MathIsimple

Single Rule Learning

Master strategies for learning individual rules. Learn top-down specialization, bottom-up generalization, evaluation metrics, and beam search for optimal rule discovery.

Module 3 of 7
Intermediate Level
120-150 min

Core Challenge: Combinatorial Explosion

Learning a single rule requires searching through a space of possible logical literals. With dd attributes, each with viv_i possible values, the search space can be exponentially large. Efficient search strategies and evaluation metrics are essential to find good rules.

Search Space Size

For a rule with up to kk literals, the number of possible rules is:

i=1k(dvi)\sum_{i=1}^k \binom{d \cdot v}{i}

Where dvd \cdot v is the total number of possible literals. This grows exponentially with kk, making exhaustive search infeasible.

Search Strategies

Two main approaches to search the rule space: top-down (general to specific) and bottom-up (specific to general).

Top-Down Search (General to Specific)

Start with the most general rule (empty body, covers all samples) and gradually add literals to specialize the rule, reducing coverage but increasing accuracy.

Algorithm Steps:

  1. 1. Initialize: Start with rule r:head.r: \text{head} \leftarrow .(empty body, covers all samples)
  2. 2. Generate candidates: For each possible literal ll, create candidate rule rlr \land l
  3. 3. Evaluate candidates: Compute evaluation metric (accuracy, information gain, etc.) for each candidate
  4. 4. Select best: Choose the candidate with highest metric value
  5. 5. Repeat: Continue until accuracy threshold met or no improvement possible

Example Progression:

approve.\text{approve} \leftarrow .approve(income50000)\text{approve} \leftarrow (\text{income} \geq 50000)approve(income50000)(credit_score700)\text{approve} \leftarrow (\text{income} \geq 50000) \land (\text{credit\_score} \geq 700)

Bottom-Up Search (Specific to General)

Start with a specific rule (covers a single positive example) and gradually remove literals to generalize the rule, increasing coverage while maintaining accuracy.

Algorithm Steps:

  1. 1. Initialize: Start with rule covering a single positive examplexx: r:headi(ai=xi)r: \text{head} \leftarrow \bigwedge_{i} (a_i = x_i)
  2. 2. Generate candidates: For each literal ll in current rule, create candidate r{l}r \setminus \{l\} (remove literal)
  3. 3. Evaluate candidates: Check if removing literal maintains acceptable accuracy
  4. 4. Select best: Choose candidate with best coverage-accuracy trade-off
  5. 5. Repeat: Continue until coverage threshold met or accuracy drops too low

Example Progression:

approve(income=55000)(credit_score=720)(employment=stable)\text{approve} \leftarrow (\text{income} = 55000) \land (\text{credit\_score} = 720) \land (\text{employment} = \text{stable})approve(income50000)(credit_score700)\text{approve} \leftarrow (\text{income} \geq 50000) \land (\text{credit\_score} \geq 700)

Comparison

Top-Down:

  • ✓ Clear search direction
  • ✓ Less likely to miss important literals
  • ✗ May add redundant literals
  • ✗ Can get stuck in local optima

Bottom-Up:

  • ✓ Avoids redundant literals
  • ✓ More concise rules
  • ✗ Large initial search space
  • ✗ May cover negative examples

Evaluation Metrics

Evaluation metrics determine which candidate rule is best when adding or removing literals. Common metrics balance accuracy and coverage.

Accuracy

Proportion of covered samples that are positive examples:

accuracy(r)=covered_positives(r)covered_positives(r)+covered_negatives(r)\text{accuracy}(r) = \frac{|\text{covered\_positives}(r)|}{|\text{covered\_positives}(r)| + |\text{covered\_negatives}(r)|}

Higher accuracy means fewer negative examples are covered. A rule with 100% accuracy covers only positive examples.

Information Gain (Entropy Reduction)

Measures the reduction in entropy (uncertainty) when adding a literal to the rule:

IG(r,l)=H(D)H(Drl)\text{IG}(r, l) = H(D) - H(D | r \land l)

Where H(D)H(D) is the entropy of the dataset before adding literal ll, and H(Drl)H(D | r \land l) is the entropy of samples covered by rule rlr \land l.

Entropy Formula:

H(D)=ipilog2piH(D) = -\sum_{i} p_i \log_2 p_i

Where pip_i is the proportion of class ii in dataset DD.

Gini Coefficient (Impurity)

Measures the impurity of covered samples. Lower Gini means higher purity (more homogeneous):

Gini(D)=1ipi2\text{Gini}(D) = 1 - \sum_{i} p_i^2

Gini ranges from 0 (pure, all same class) to 0.5 (maximum impurity, equal classes). A rule with low Gini coefficient is preferred.

Avoiding Local Optima: Beam Search

Instead of keeping only the best candidate rule at each step (greedy), beam searchmaintains the top-kk candidate rules, expanding the search space and reducing the risk of getting stuck in local optima.

Beam Search Algorithm

  1. 1. Initialize beam: Start with top-kk most general rules (for top-down) or top-kk specific rules (for bottom-up)
  2. 2. Expand all candidates: For each rule in the beam, generate all possible candidate rules (add/remove literals)
  3. 3. Evaluate all candidates: Compute evaluation metric for each candidate
  4. 4. Select top-k: Keep the top-kk candidates with highest metric values as the new beam
  5. 5. Repeat: Continue until termination condition met (accuracy threshold, max depth, etc.)
  6. 6. Return best: Select the best rule from the final beam

Advantages:

  • • Explores multiple search paths simultaneously
  • • Less likely to miss optimal rules
  • • Balances exploration and exploitation
  • • Parameter kk controls search breadth

Customer Churn Prediction Example

Apply top-down search to learn a rule predicting customer churn. Training set: 100 customers (60 churn, 40 stay). Use beam search with k=3k=3.

Top-Down Search Progression

StepRuleCoverageAccuracyDescription
1churn ← .All (100)
60%
Most general rule
2churn ← (age < 30)40
75%
Add age condition
3churn ← (age < 30) ∧ (usage < 5hrs)25
88%
Add usage condition
4churn ← (age < 30) ∧ (usage < 5hrs) ∧ (support_tickets > 2)15
93%
Final specialized rule

At each step, beam search maintains top-3 candidate rules. Step 4 rule achieves 93% accuracy with 15 samples covered, meeting the accuracy threshold.

Evaluation Metric Calculation

Step 3: Adding usage condition

Current rule: churn(age<30)\text{churn} \leftarrow (\text{age} < 30)

Coverage: 25 samples (22 churn, 3 stay)

accuracy=2225=0.88=88%\text{accuracy} = \frac{22}{25} = 0.88 = 88\%

Information gain: IG=H(D)H(Dr)=0.9710.529=0.442\text{IG} = H(D) - H(D | r) = 0.971 - 0.529 = 0.442

Adding usage condition increases accuracy from 75% to 88% with significant information gain.

Final Rule:

churn(age<30)(usage<5hrs)(support_tickets>2)\text{churn} \leftarrow (\text{age} < 30) \land (\text{usage} < 5\text{hrs}) \land (\text{support\_tickets} > 2)

This rule identifies young customers with low usage and high support tickets as likely to churn, with 93% accuracy covering 15% of the dataset.

Advantages and Limitations

Advantages

  • Systematic search: Explores rule space systematically
  • Flexible metrics: Can use various evaluation criteria
  • Beam search: Reduces local optima risk
  • Interpretable process: Clear search progression

Limitations

  • Combinatorial explosion: Large search space for many attributes
  • Metric selection: Choice of metric affects results
  • Local optima: Even beam search may miss global optimum
  • Computational cost: Can be slow for large datasets