Master decision tree algorithms including ID3, C4.5, CART, and pruning techniques
Entropy measures the disorder or uncertainty in a dataset:
• = proportion of class in dataset
• when all samples belong to one class (pure)
• for binary classification with equal split (most impure)
Reduction in entropy after splitting on attribute :
Choose attribute with maximum information gain for splitting
Dataset: 9 positive, 5 negative
Left: 7 pos, 1 neg; Right: 2 pos, 4 neg
Gini index measures the probability of misclassification:
• when all samples belong to one class
• for binary classification with equal split
• Lower Gini = purer node
9 positive, 5 negative:
Criterion
Information Gain (Entropy)
Splits
Multi-way (one branch per value)
Limitations
Criterion
Gain Ratio
Improvements
Criterion
Gini Index
Splits
Binary (two branches only)
Features
Addresses ID3's bias toward attributes with many values:
Measures how broadly and uniformly the attribute splits data
Penalizes attributes that split data into many small subsets
Pruning simplifies the tree by removing branches with little predictive power, improving generalization.
Stop tree growth early based on criteria:
✓ Fast, no separate validation needed
✗ May stop too early
Grow full tree, then prune back:
✓ Better results, no premature stopping
✗ Computationally expensive
where is misclassification error, is number of leaves, balances complexity. Find subtree that minimizes using cross-validation.
For continuous attributes, find optimal threshold:
Example:
Age values: {10, 20, 30, 40}
Test thresholds: 15, 25, 35
Choose best split: Age ≤ 25 vs Age > 25
Three strategies for handling missing attribute values:
1. Most Common Value
Assign most frequent value among samples at node
2. Most Common by Class
Use most frequent value within the same class
3. Probability-based Split
Assign fractional samples to each branch based on split proportions (C4.5 approach)
| Contract | Support Calls | Tenure (months) | Churned? |
|---|---|---|---|
| Month-to-Month | 5 | 3 | Yes |
| Annual | 1 | 24 | No |
| Month-to-Month | 4 | 2 | Yes |
| Annual | 2 | 18 | No |
| Month-to-Month | 3 | 6 | Yes |
| Biennial | 0 | 36 | No |
| Annual | 1 | 12 | No |
| Month-to-Month | 6 | 1 | Yes |
| Biennial | 1 | 48 | No |
| Month-to-Month | 2 | 8 | No |
5 churned, 5 didn't churn
• Month-to-Month: 5 customers (4 churn, 1 no) →
• Annual: 3 customers (0 churn, 3 no) →
• Biennial: 2 customers (0 churn, 2 no) →
Contract Type
├── Month-to-Month → [Further split on Support Calls]
├── Annual → No Churn
└── Biennial → No Churn
where p_i is the proportion of class i in set S. Measures average "surprise" or uncertainty.
Convention: 0 log 0 = 0 (limiting case as p → 0)
Property 1: Non-negativity
Proof: Since 0 ≤ p_i ≤ 1, we have log p_i ≤ 0, so -p_i log p_i ≥ 0. Sum of non-negative terms is non-negative.
Property 2: Maximum Entropy
Proof (using Lagrange multipliers):
Maximize H = -Σp_i log p_i subject to Σp_i = 1
Form Lagrangian: L = -Σp_i log p_i - λ(Σp_i - 1)
∂L/∂p_i = -log p_i - 1/ln(2) - λ = 0
This gives p_i = constant = 1/K (uniform distribution)
Maximum entropy: H_max = -Σ(1/K)log(1/K) = log K
Uniform distribution has maximum uncertainty!
Property 3: Concavity
Entropy is a concave function of the probability distribution. This means mixing distributions increases uncertainty.
Property 4: Additivity
For independent variables X and Y:
For binary classification with positive proportion p:
Maximum at p = 0.5 (50-50 split), minimum at p = 0 or p = 1 (pure).
where S_v is the subset with attribute A = v.
Information Gain = Parent Entropy - Weighted Average of Child Entropies
Dataset: Play Tennis (14 examples, 9 Yes, 5 No)
Step 1: Parent Entropy
Step 2: Split on "Outlook" (Sunny/Overcast/Rain)
Step 3: Calculate Child Entropies
H(Sunny) = -(2/5)log(2/5) - (3/5)log(3/5) = 0.971 bits
H(Overcast) = 0 bits (pure!)
H(Rain) = -(3/5)log(3/5) - (2/5)log(2/5) = 0.971 bits
Step 4: Weighted Average
Step 5: Information Gain
This split reduces uncertainty by 0.247 bits! Choose attribute with highest IG.
Problem: IG favors attributes with many values (e.g., ID has perfect IG but no generalization).
Solution: Normalize by split information:
SplitInfo penalizes attributes that create many small partitions.
where p_i is the proportion of class i. Measures probability of misclassifying a random sample.
Probabilistic Interpretation:
If we pick a random sample from S and randomly assign it a class according to the distribution in S, Gini(S) is the probability of incorrect classification.
Gini Index
Entropy
For binary case with p_1 = p, p_2 = 1-p:
Taylor expansion of H around p=0.5:
Both are quadratic near p=0.5, explaining similar behavior in practice!
Choose split that maximizes Gini gain (minimizes weighted child impurity).
Problem: Find best split among m features for n samples.
For each feature:
Total per feature: O(n × K)
For m features:
Better approach: Sort once, then scan for best threshold.
Algorithm:
Per feature: O(n log n + nK)
For m features:
(assuming K < log n, which is typical)
Worst case: Completely imbalanced tree (one sample per leaf)
• Tree depth: O(n)
• At depth d: process (n-d) samples
• Total: Σ(n-d) × m × (n-d) log(n-d) for d=0 to n
Best case: Balanced tree
• Tree depth: O(log n)
• At each level: process all n samples
• log n levels × (mn log n) per level
This is the expected case for reasonable data!
For a single sample:
Very fast prediction, especially compared to O(n) for k-NN!
For a subtree T, define the cost-complexity measure:
• R(T) = misclassification rate (or sum of impurities) of tree T
• |T| = number of leaf nodes in T
• α ≥ 0 = complexity parameter (regularization)
Goal: Find subtree T_α that minimizes C_α(T) for given α.
For each internal node t with leaves L_t:
Compute effective α for pruning t:
where T_t is subtree rooted at t
Find node t* with smallest α_t:
Optimal Subtree Theorem:
For each α, the tree T_α in this sequence minimizes C_α among all possible subtrees.
Requires separate validation set:
Pro: Simpler, guaranteed not to increase validation error.
Con: Requires separate validation set (less data for training).
Test your understanding with 10 multiple-choice questions