Master the entropy-based approach to selecting optimal splitting attributes in decision trees with the classic ID3 algorithm
Information Gain measures the reduction in entropy (uncertainty) achieved by partitioning data according to a given attribute. It quantifies how much information about the class label we gain by knowing the value of that attribute. The ID3 (Iterative Dichotomiser 3) algorithm, developed by Ross Quinlan in 1986, uses information gain to build decision trees by selecting the attribute that provides the maximum information gain at each step.
Imagine you're trying to guess someone's class label and can ask questions about their attributes. Information gain tells you which question reduces your uncertainty the most. A good split creates child nodes that are more "pure" (homogeneous) than the parent node.
Example: If knowing someone's income instantly tells you whether they bought a house (high income → purchased, low income → didn't purchase), income has high information gain. If knowing their favorite color tells you nothing about house purchase, it has zero information gain.
Attribute creates pure or nearly-pure child nodes. Most samples in each child belong to one class. This attribute effectively separates classes and should be used for splitting.
Attribute creates mixed child nodes with similar class distributions. The split doesn't help distinguish between classes. This attribute is less useful for splitting.
Measuring uncertainty in a dataset
Entropy is a concept from information theory that measures the impurity, disorder, or uncertainty in a dataset. Claude Shannon introduced it in 1948 as a measure of information content. In machine learning, entropy quantifies how mixed the class labels are in a set of samples.
For a dataset D with |Y| classes:
Ent(D) = -∑k=1|Y| pk log2 pk
Where pk is the proportion of samples belonging to class k.
Note: By convention, 0 log2 0 = 0 (if a class doesn't appear, it contributes nothing to entropy)
Perfect purity—all samples belong to one class
Example: 20 Yes, 0 No
pyes = 1.0, pno = 0.0
Ent = -(1.0 × log21.0) = 0
Maximum uncertainty—equal distribution (binary case)
Example: 10 Yes, 10 No
pyes = 0.5, pno = 0.5
Ent = -(0.5×log20.5 + 0.5×log20.5) = 1.0
Some impurity—classes are imbalanced
Example: 15 Yes, 5 No
pyes = 0.75, pno = 0.25
Ent ≈ 0.811
Dataset: 100 house buyers: 60 purchased (Yes), 40 didn't purchase (No)
Step 1: Calculate proportions
pyes = 60/100 = 0.6
pno = 40/100 = 0.4
Step 2: Calculate log values
log2(0.6) ≈ -0.737
log2(0.4) ≈ -1.322
Step 3: Apply entropy formula
Ent(D) = -(0.6 × -0.737 + 0.4 × -1.322)
Ent(D) = -(−0.442 − 0.529)
Ent(D) = 0.971
Interpretation: Entropy of 0.971 (close to 1.0) indicates high uncertainty—the dataset is fairly mixed.
Information Gain is the reduction in entropy after splitting the dataset on an attribute. It compares the entropy before the split (parent node) with the weighted average entropy after the split (child nodes).
Gain(D, a) = Ent(D) - ∑v=1V (|Dv|/|D|) × Ent(Dv)
Where:
The split significantly reduces entropy. Child nodes are much purer than the parent. This attribute is an excellent choice for splitting. Gain of 0.8+ indicates a very strong split.
The split barely reduces entropy. Child nodes have similar class distributions to the parent. This attribute provides little information. Gain below 0.1 is typically not worth splitting on.
A real estate analytics company wants to predict house purchases. Here's their dataset of 200 potential buyers:
| ID | Income | Credit | Debt | Employment | Location | Purchased |
|---|---|---|---|---|---|---|
| 1 | High | Excellent | Low | Stable | Urban | Yes |
| 2 | High | Good | Low | Stable | Suburban | Yes |
| 3 | Medium | Good | Medium | Stable | Urban | Yes |
| 4 | Low | Fair | High | Unstable | Rural | No |
| 5 | Low | Poor | High | Unstable | Suburban | No |
| 6 | Medium | Fair | Medium | Contract | Urban | Yes |
| 7 | High | Excellent | Low | Stable | Suburban | Yes |
| 8 | Low | Fair | High | Contract | Rural | No |
| 9 | Medium | Good | Low | Stable | Urban | Yes |
| 10 | Low | Poor | High | Unstable | Rural | No |
Full dataset: 200 samples total—110 purchased (Yes), 90 didn't purchase (No). Let's calculate information gain for the 'Income' attribute to decide if it's a good split.
Root node contains all 200 samples: 110 Yes, 90 No
pyes = 110/200 = 0.55
pno = 90/200 = 0.45
Ent(D) = -(0.55 × log20.55 + 0.45 × log20.45)
Ent(D) = -(0.55 × -0.863 + 0.45 × -1.152)
Ent(D) = -(-0.475 - 0.518)
Ent(D) = 0.993
Split data by Income attribute (High, Medium, Low):
70 purchased (Yes), 10 didn't (No)
pyes = 70/80 = 0.875, pno = 10/80 = 0.125
Ent(Dhigh) = -(0.875 × log20.875 + 0.125 × log20.125)
Ent(Dhigh) ≈ 0.544
35 purchased (Yes), 35 didn't (No)
pyes = 35/70 = 0.5, pno = 35/70 = 0.5
Ent(Dmedium) = -(0.5 × log20.5 + 0.5 × log20.5)
Ent(Dmedium) = 1.0 (maximum uncertainty)
5 purchased (Yes), 45 didn't (No)
pyes = 5/50 = 0.1, pno = 45/50 = 0.9
Ent(Dlow) = -(0.1 × log20.1 + 0.9 × log20.9)
Ent(Dlow) ≈ 0.469
Weight each child's entropy by the proportion of samples it contains:
Weighted_Ent = (80/200) × 0.544 + (70/200) × 1.0 + (50/200) × 0.469
Weighted_Ent = 0.4 × 0.544 + 0.35 × 1.0 + 0.25 × 0.469
Weighted_Ent = 0.218 + 0.350 + 0.117
Weighted_Ent = 0.685
Gain(D, Income) = Ent(D) - Weighted_Ent
Gain(D, Income) = 0.993 - 0.685
Gain(D, Income) = 0.308
Interpretation: Information gain of 0.308 is moderate. Splitting on Income reduces entropy by about 31%, making it a reasonable choice. We would calculate gain for other attributes (Credit, Debt, Employment, Location) and choose the one with highest gain.
Building decision trees with information gain
ID3 (Iterative Dichotomiser 3), developed by Ross Quinlan in 1986, was one of the first decision tree algorithms. It uses information gain to select the best attribute at each step and recursively builds the tree until all samples are classified or no attributes remain.
Place all training samples in the root node. If all samples belong to the same class, make it a leaf and stop.
For each unused attribute, compute Gain(D, attribute) using the information gain formula. This requires calculating entropy for the current node and all potential child nodes.
Choose a* = argmaxa Gain(D, a). This attribute becomes the splitting criterion for the current node.
For each possible value v of the selected attribute, create a branch and a child node. Partition samples into subsets Dv based on attribute values.
For each child node with subset Dv, repeat steps 2-5 using the remaining unused attributes. Continue until stopping conditions are met (pure node, no attributes left, or empty subset).
When a stopping condition is met, create a leaf node. Assign it the majority class label from the samples reaching that node.
A major limitation of information gain is its inherent bias toward attributes with many distinct values. Attributes that split data into many small subsets will naturally have lower weighted entropy and thus higher information gain—even if they don't actually provide useful information for prediction.
Suppose we add "Customer ID" (unique for each customer) as an attribute to our housing dataset:
Scenario:
200 samples, each with a unique Customer ID (ID1, ID2, ..., ID200)
What happens if we split on Customer ID:
Creates 200 child nodes, each with exactly 1 sample. Each child is perfectly pure (entropy = 0).
Weighted_Ent = (1/200) × 0 + (1/200) × 0 + ... = 0
Gain(D, CustomerID) = 0.993 - 0 = 0.993 (maximum possible!)
Problem:
Customer ID gets selected as the best attribute, even though it's completely useless for prediction! It perfectly "memorizes" training data but has zero generalization ability. A new customer will never match any existing ID.
If recorded as "YYYY-MM-DD", could have hundreds of unique values. High information gain, but likely not useful unless there's a strong temporal pattern.
Better: Extract useful features like "Month" (12 values), "Season" (4 values), or "Day of Week" (7 values).
In the US, ~40,000 zip codes. Very high information gain due to many values, but may overfit to specific locations rather than generalizable patterns.
Better: Use "Region" (Northeast, South, etc.) or "Urban/Suburban/Rural" instead.
Information gain doesn't penalize attributes for having many values. It measures purity gain without considering how that gain is achieved. Splitting into many tiny subsets artificially inflates information gain.
Solution: Use Gain Ratio (C4.5 algorithm) instead, which normalizes information gain by the "intrinsic information" of the attribute, penalizing attributes with many values. We'll cover this in the next module.
A: Log base 2 comes from information theory where information is measured in "bits." One bit represents a choice between two equally likely options. Using log₂ makes entropy maximum = 1.0 for binary classification with balanced classes. However, you can use natural log (ln) or log₁₀—it only changes the scale, not which attribute has highest gain. What matters is consistency.
A: It's rare due to floating-point precision, but if it happens, the choice is arbitrary. Different implementations handle this differently—some pick the first attribute encountered, others choose randomly, and some use a secondary criterion like the attribute with fewer values. In practice, tiny differences exist, so exact ties are uncommon.
A: No, information gain is always ≥ 0. Entropy after splitting (weighted average of child entropies) can never exceed parent entropy due to the properties of entropy. At worst, if a split doesn't help at all, gain = 0 (child nodes have same entropy as parent). A gain of 0 means the attribute provides no information about the class label.
A: Original ID3 doesn't handle continuous features—they must be discretized (binned) into categorical ranges first (e.g., age → "18-25", "26-35", etc.). This is a manual preprocessing step. Later algorithms like C4.5 and CART can handle continuous features natively by finding optimal split thresholds (e.g., "age ≤ 30" vs "age > 30").
A: Not really. ID3 is primarily of historical and educational importance. Its successor C4.5 (also by Quinlan, 1993) addressed most of ID3's limitations. Modern libraries like scikit-learn use CART (Classification and Regression Trees) by default, which uses Gini index instead of information gain and handles continuous features. However, understanding ID3 and information gain is essential for learning decision tree fundamentals.
Entropy measures uncertainty/impurity: 0 (pure) to 1 (maximum uncertainty for binary)
Information gain = reduction in entropy after splitting on an attribute
ID3 algorithm selects attribute with maximum information gain at each node
High gain means split creates pure child nodes; low gain means mixed children
Calculate weighted average entropy across all child nodes after split
Major limitation: biased toward attributes with many distinct values
ID3 only handles categorical features; continuous features must be discretized
Modern alternatives: C4.5 (gain ratio), CART (Gini index) address ID3's limitations