Single Rule Learning

Core Challenge: Combinatorial Explosion

Learning a single rule requires searching through a space of possible logical literals. With $d$ attributes, each with $v_i$ possible values, the search space can be exponentially large. Efficient search strategies and evaluation metrics are essential to find good rules.

Search Space Size

For a rule with up to $k$ literals, the number of possible rules is:

\sum_{i=1}^k \binom{d \cdot v}{i}

Where $d \cdot v$ is the total number of possible literals. This grows exponentially with $k$ , making exhaustive search infeasible.

Search Strategies

Two main approaches to search the rule space: top-down (general to specific) and bottom-up (specific to general).

Top-Down Search (General to Specific)

Start with the most general rule (empty body, covers all samples) and gradually add literals to specialize the rule, reducing coverage but increasing accuracy.

Algorithm Steps:

1. Initialize: Start with rule $r: \text{head} \leftarrow .$ (empty body, covers all samples)
2. Generate candidates: For each possible literal $l$ , create candidate rule $r \land l$
3. Evaluate candidates: Compute evaluation metric (accuracy, information gain, etc.) for each candidate
4. Select best: Choose the candidate with highest metric value
5. Repeat: Continue until accuracy threshold met or no improvement possible

Example Progression:

$\text{approve} \leftarrow .$ → $\text{approve} \leftarrow (\text{income} \geq 50000)$ → $\text{approve} \leftarrow (\text{income} \geq 50000) \land (\text{credit\_score} \geq 700)$

Bottom-Up Search (Specific to General)

Start with a specific rule (covers a single positive example) and gradually remove literals to generalize the rule, increasing coverage while maintaining accuracy.

Algorithm Steps:

1. Initialize: Start with rule covering a single positive example $x$ : $r: \text{head} \leftarrow \bigwedge_{i} (a_i = x_i)$
2. Generate candidates: For each literal $l$ in current rule, create candidate $r \setminus \{l\}$ (remove literal)
3. Evaluate candidates: Check if removing literal maintains acceptable accuracy
4. Select best: Choose candidate with best coverage-accuracy trade-off
5. Repeat: Continue until coverage threshold met or accuracy drops too low

Example Progression:

$\text{approve} \leftarrow (\text{income} = 55000) \land (\text{credit\_score} = 720) \land (\text{employment} = \text{stable})$ → $\text{approve} \leftarrow (\text{income} \geq 50000) \land (\text{credit\_score} \geq 700)$

Comparison

Top-Down:

✓ Clear search direction
✓ Less likely to miss important literals
✗ May add redundant literals
✗ Can get stuck in local optima

Bottom-Up:

✓ Avoids redundant literals
✓ More concise rules
✗ Large initial search space
✗ May cover negative examples

Evaluation Metrics

Evaluation metrics determine which candidate rule is best when adding or removing literals. Common metrics balance accuracy and coverage.

Accuracy

Proportion of covered samples that are positive examples:

\text{accuracy}(r) = \frac{|\text{covered\_positives}(r)|}{|\text{covered\_positives}(r)| + |\text{covered\_negatives}(r)|}

Higher accuracy means fewer negative examples are covered. A rule with 100% accuracy covers only positive examples.

Information Gain (Entropy Reduction)

Measures the reduction in entropy (uncertainty) when adding a literal to the rule:

\text{IG}(r, l) = H(D) - H(D | r \land l)

Where $H(D)$ is the entropy of the dataset before adding literal $l$ , and $H(D | r \land l)$ is the entropy of samples covered by rule $r \land l$ .

Entropy Formula:

H(D) = -\sum_{i} p_i \log_2 p_i

Where $p_i$ is the proportion of class $i$ in dataset $D$ .

Gini Coefficient (Impurity)

Measures the impurity of covered samples. Lower Gini means higher purity (more homogeneous):

\text{Gini}(D) = 1 - \sum_{i} p_i^2

Gini ranges from 0 (pure, all same class) to 0.5 (maximum impurity, equal classes). A rule with low Gini coefficient is preferred.

Avoiding Local Optima: Beam Search

Instead of keeping only the best candidate rule at each step (greedy), beam searchmaintains the top- $k$ candidate rules, expanding the search space and reducing the risk of getting stuck in local optima.

Beam Search Algorithm

1. Initialize beam: Start with top- $k$ most general rules (for top-down) or top- $k$ specific rules (for bottom-up)
2. Expand all candidates: For each rule in the beam, generate all possible candidate rules (add/remove literals)
3. Evaluate all candidates: Compute evaluation metric for each candidate
4. Select top-k: Keep the top- $k$ candidates with highest metric values as the new beam
5. Repeat: Continue until termination condition met (accuracy threshold, max depth, etc.)
6. Return best: Select the best rule from the final beam

Advantages:

• Explores multiple search paths simultaneously
• Less likely to miss optimal rules
• Balances exploration and exploitation
• Parameter $k$ controls search breadth

Customer Churn Prediction Example

Apply top-down search to learn a rule predicting customer churn. Training set: 100 customers (60 churn, 40 stay). Use beam search with $k=3$ .

Top-Down Search Progression

Step	Rule	Coverage	Accuracy	Description
1	churn ← .	All (100)	60%	Most general rule
2	churn ← (age < 30)	40	75%	Add age condition
3	churn ← (age < 30) ∧ (usage < 5hrs)	25	88%	Add usage condition
4	churn ← (age < 30) ∧ (usage < 5hrs) ∧ (support_tickets > 2)	15	93%	Final specialized rule

At each step, beam search maintains top-3 candidate rules. Step 4 rule achieves 93% accuracy with 15 samples covered, meeting the accuracy threshold.

Evaluation Metric Calculation

Step 3: Adding usage condition

Current rule: $\text{churn} \leftarrow (\text{age} < 30)$

Coverage: 25 samples (22 churn, 3 stay)

\text{accuracy} = \frac{22}{25} = 0.88 = 88\%

Information gain: $\text{IG} = H(D) - H(D | r) = 0.971 - 0.529 = 0.442$

Adding usage condition increases accuracy from 75% to 88% with significant information gain.

Final Rule:

$\text{churn} \leftarrow (\text{age} < 30) \land (\text{usage} < 5\text{hrs}) \land (\text{support\_tickets} > 2)$

This rule identifies young customers with low usage and high support tickets as likely to churn, with 93% accuracy covering 15% of the dataset.