MathIsimple

Information Gain & ID3 Algorithm

Master the entropy-based approach to selecting optimal splitting attributes in decision trees with the classic ID3 algorithm

What is Information Gain?

Splitting Criterion

Information Gain measures the reduction in entropy (uncertainty) achieved by partitioning data according to a given attribute. It quantifies how much information about the class label we gain by knowing the value of that attribute. The ID3 (Iterative Dichotomiser 3) algorithm, developed by Ross Quinlan in 1986, uses information gain to build decision trees by selecting the attribute that provides the maximum information gain at each step.

Key Intuition

Imagine you're trying to guess someone's class label and can ask questions about their attributes. Information gain tells you which question reduces your uncertainty the most. A good split creates child nodes that are more "pure" (homogeneous) than the parent node.

Example: If knowing someone's income instantly tells you whether they bought a house (high income → purchased, low income → didn't purchase), income has high information gain. If knowing their favorite color tells you nothing about house purchase, it has zero information gain.

High Information Gain

Attribute creates pure or nearly-pure child nodes. Most samples in each child belong to one class. This attribute effectively separates classes and should be used for splitting.

Low Information Gain

Attribute creates mixed child nodes with similar class distributions. The split doesn't help distinguish between classes. This attribute is less useful for splitting.

Information Entropy

Measuring uncertainty in a dataset

Entropy is a concept from information theory that measures the impurity, disorder, or uncertainty in a dataset. Claude Shannon introduced it in 1948 as a measure of information content. In machine learning, entropy quantifies how mixed the class labels are in a set of samples.

Entropy Formula

For a dataset D with |Y| classes:

Ent(D) = -∑k=1|Y| pk log2 pk

Where pk is the proportion of samples belonging to class k.

Note: By convention, 0 log2 0 = 0 (if a class doesn't appear, it contributes nothing to entropy)

Entropy Properties & Examples

Minimum: 0

Perfect purity—all samples belong to one class

Example: 20 Yes, 0 No

pyes = 1.0, pno = 0.0
Ent = -(1.0 × log21.0) = 0

Maximum: 1.0

Maximum uncertainty—equal distribution (binary case)

Example: 10 Yes, 10 No

pyes = 0.5, pno = 0.5
Ent = -(0.5×log20.5 + 0.5×log20.5) = 1.0

Moderate: 0-1

Some impurity—classes are imbalanced

Example: 15 Yes, 5 No

pyes = 0.75, pno = 0.25
Ent ≈ 0.811

Detailed Calculation Example

Dataset: 100 house buyers: 60 purchased (Yes), 40 didn't purchase (No)

Step 1: Calculate proportions

pyes = 60/100 = 0.6
pno = 40/100 = 0.4

Step 2: Calculate log values

log2(0.6) ≈ -0.737
log2(0.4) ≈ -1.322

Step 3: Apply entropy formula

Ent(D) = -(0.6 × -0.737 + 0.4 × -1.322)
Ent(D) = -(−0.442 − 0.529)
Ent(D) = 0.971

Interpretation: Entropy of 0.971 (close to 1.0) indicates high uncertainty—the dataset is fairly mixed.

Information Gain Calculation

Information Gain is the reduction in entropy after splitting the dataset on an attribute. It compares the entropy before the split (parent node) with the weighted average entropy after the split (child nodes).

Information Gain Formula

Gain(D, a) = Ent(D) - ∑v=1V (|Dv|/|D|) × Ent(Dv)

Where:

  • D is the dataset (parent node)
  • a is the attribute being evaluated
  • V is the number of possible values for attribute a
  • Dv is the subset of D where attribute a takes value v
  • |Dv|/|D| is the weight (proportion) of samples with value v

Interpretation

High Gain (Close to 1.0)

The split significantly reduces entropy. Child nodes are much purer than the parent. This attribute is an excellent choice for splitting. Gain of 0.8+ indicates a very strong split.

Low Gain (Close to 0)

The split barely reduces entropy. Child nodes have similar class distributions to the parent. This attribute provides little information. Gain below 0.1 is typically not worth splitting on.

Complete Example: Housing Purchase Decision

A real estate analytics company wants to predict house purchases. Here's their dataset of 200 potential buyers:

IDIncomeCreditDebtEmploymentLocationPurchased
1HighExcellentLowStableUrbanYes
2HighGoodLowStableSuburbanYes
3MediumGoodMediumStableUrbanYes
4LowFairHighUnstableRuralNo
5LowPoorHighUnstableSuburbanNo
6MediumFairMediumContractUrbanYes
7HighExcellentLowStableSuburbanYes
8LowFairHighContractRuralNo
9MediumGoodLowStableUrbanYes
10LowPoorHighUnstableRuralNo

Full dataset: 200 samples total—110 purchased (Yes), 90 didn't purchase (No). Let's calculate information gain for the 'Income' attribute to decide if it's a good split.

Step 1: Calculate Parent Entropy

Root node contains all 200 samples: 110 Yes, 90 No

pyes = 110/200 = 0.55
pno = 90/200 = 0.45

Ent(D) = -(0.55 × log20.55 + 0.45 × log20.45)
Ent(D) = -(0.55 × -0.863 + 0.45 × -1.152)
Ent(D) = -(-0.475 - 0.518)
Ent(D) = 0.993

Step 2: Calculate Entropy for Each Income Level

Split data by Income attribute (High, Medium, Low):

Income = High (80 samples)

70 purchased (Yes), 10 didn't (No)

pyes = 70/80 = 0.875, pno = 10/80 = 0.125
Ent(Dhigh) = -(0.875 × log20.875 + 0.125 × log20.125)
Ent(Dhigh) ≈ 0.544

Income = Medium (70 samples)

35 purchased (Yes), 35 didn't (No)

pyes = 35/70 = 0.5, pno = 35/70 = 0.5
Ent(Dmedium) = -(0.5 × log20.5 + 0.5 × log20.5)
Ent(Dmedium) = 1.0 (maximum uncertainty)

Income = Low (50 samples)

5 purchased (Yes), 45 didn't (No)

pyes = 5/50 = 0.1, pno = 45/50 = 0.9
Ent(Dlow) = -(0.1 × log20.1 + 0.9 × log20.9)
Ent(Dlow) ≈ 0.469

Step 3: Calculate Weighted Average Entropy After Split

Weight each child's entropy by the proportion of samples it contains:

Weighted_Ent = (80/200) × 0.544 + (70/200) × 1.0 + (50/200) × 0.469
Weighted_Ent = 0.4 × 0.544 + 0.35 × 1.0 + 0.25 × 0.469
Weighted_Ent = 0.218 + 0.350 + 0.117
Weighted_Ent = 0.685

Step 4: Calculate Information Gain

Gain(D, Income) = Ent(D) - Weighted_Ent
Gain(D, Income) = 0.993 - 0.685
Gain(D, Income) = 0.308

Interpretation: Information gain of 0.308 is moderate. Splitting on Income reduces entropy by about 31%, making it a reasonable choice. We would calculate gain for other attributes (Credit, Debt, Employment, Location) and choose the one with highest gain.

The ID3 Algorithm

Building decision trees with information gain

ID3 (Iterative Dichotomiser 3), developed by Ross Quinlan in 1986, was one of the first decision tree algorithms. It uses information gain to select the best attribute at each step and recursively builds the tree until all samples are classified or no attributes remain.

ID3 Algorithm Steps

1

Start with root node

Place all training samples in the root node. If all samples belong to the same class, make it a leaf and stop.

2

Calculate information gain for all attributes

For each unused attribute, compute Gain(D, attribute) using the information gain formula. This requires calculating entropy for the current node and all potential child nodes.

3

Select attribute with maximum information gain

Choose a* = argmaxa Gain(D, a). This attribute becomes the splitting criterion for the current node.

4

Create branches for each attribute value

For each possible value v of the selected attribute, create a branch and a child node. Partition samples into subsets Dv based on attribute values.

5

Recursively build subtrees

For each child node with subset Dv, repeat steps 2-5 using the remaining unused attributes. Continue until stopping conditions are met (pure node, no attributes left, or empty subset).

6

Create leaf nodes

When a stopping condition is met, create a leaf node. Assign it the majority class label from the samples reaching that node.

Key Characteristics of ID3

  • Greedy approach: Makes locally optimal choices without backtracking
  • Top-down construction: Builds tree from root to leaves
  • Only handles categorical features: Continuous features must be discretized first
  • No pruning: Original ID3 grows complete trees, often leading to overfitting
  • Can't handle missing values: Requires complete data for all attributes

Problem: Bias Toward Multi-Valued Attributes

Critical Limitation

A major limitation of information gain is its inherent bias toward attributes with many distinct values. Attributes that split data into many small subsets will naturally have lower weighted entropy and thus higher information gain—even if they don't actually provide useful information for prediction.

Extreme Example: Customer ID

Suppose we add "Customer ID" (unique for each customer) as an attribute to our housing dataset:

Scenario:

200 samples, each with a unique Customer ID (ID1, ID2, ..., ID200)

What happens if we split on Customer ID:

Creates 200 child nodes, each with exactly 1 sample. Each child is perfectly pure (entropy = 0).

Weighted_Ent = (1/200) × 0 + (1/200) × 0 + ... = 0
Gain(D, CustomerID) = 0.993 - 0 = 0.993 (maximum possible!)

Problem:

Customer ID gets selected as the best attribute, even though it's completely useless for prediction! It perfectly "memorizes" training data but has zero generalization ability. A new customer will never match any existing ID.

More Realistic Examples

Date of Purchase

If recorded as "YYYY-MM-DD", could have hundreds of unique values. High information gain, but likely not useful unless there's a strong temporal pattern.

Better: Extract useful features like "Month" (12 values), "Season" (4 values), or "Day of Week" (7 values).

Zip Code

In the US, ~40,000 zip codes. Very high information gain due to many values, but may overfit to specific locations rather than generalizable patterns.

Better: Use "Region" (Northeast, South, etc.) or "Urban/Suburban/Rural" instead.

Why This Matters

Information gain doesn't penalize attributes for having many values. It measures purity gain without considering how that gain is achieved. Splitting into many tiny subsets artificially inflates information gain.

Solution: Use Gain Ratio (C4.5 algorithm) instead, which normalizes information gain by the "intrinsic information" of the attribute, penalizing attributes with many values. We'll cover this in the next module.

Practical Considerations

When to Use Information Gain / ID3

  • Educational purposes—simplest decision tree algorithm to understand
  • All features are categorical with similar numbers of values
  • Quick prototyping with small, clean datasets
  • No missing values in data
  • Interpretability is more important than performance

Modern Alternatives

  • C4.5: Uses gain ratio, handles continuous & missing values
  • CART: Uses Gini index, binary trees, supports regression
  • Random Forest: Ensemble of trees for better performance
  • XGBoost/LightGBM: Gradient boosting for production use
  • Most libraries (scikit-learn) use CART by default

Frequently Asked Questions

Q: Why do we use log base 2 in entropy calculations?

A: Log base 2 comes from information theory where information is measured in "bits." One bit represents a choice between two equally likely options. Using log₂ makes entropy maximum = 1.0 for binary classification with balanced classes. However, you can use natural log (ln) or log₁₀—it only changes the scale, not which attribute has highest gain. What matters is consistency.

Q: What if two attributes have the same information gain?

A: It's rare due to floating-point precision, but if it happens, the choice is arbitrary. Different implementations handle this differently—some pick the first attribute encountered, others choose randomly, and some use a secondary criterion like the attribute with fewer values. In practice, tiny differences exist, so exact ties are uncommon.

Q: Can information gain be negative?

A: No, information gain is always ≥ 0. Entropy after splitting (weighted average of child entropies) can never exceed parent entropy due to the properties of entropy. At worst, if a split doesn't help at all, gain = 0 (child nodes have same entropy as parent). A gain of 0 means the attribute provides no information about the class label.

Q: How does ID3 handle continuous features like age or income in dollars?

A: Original ID3 doesn't handle continuous features—they must be discretized (binned) into categorical ranges first (e.g., age → "18-25", "26-35", etc.). This is a manual preprocessing step. Later algorithms like C4.5 and CART can handle continuous features natively by finding optimal split thresholds (e.g., "age ≤ 30" vs "age > 30").

Q: Is ID3 still used in practice today?

A: Not really. ID3 is primarily of historical and educational importance. Its successor C4.5 (also by Quinlan, 1993) addressed most of ID3's limitations. Modern libraries like scikit-learn use CART (Classification and Regression Trees) by default, which uses Gini index instead of information gain and handles continuous features. However, understanding ID3 and information gain is essential for learning decision tree fundamentals.

Key Takeaways

Entropy measures uncertainty/impurity: 0 (pure) to 1 (maximum uncertainty for binary)

Information gain = reduction in entropy after splitting on an attribute

ID3 algorithm selects attribute with maximum information gain at each node

High gain means split creates pure child nodes; low gain means mixed children

Calculate weighted average entropy across all child nodes after split

Major limitation: biased toward attributes with many distinct values

ID3 only handles categorical features; continuous features must be discretized

Modern alternatives: C4.5 (gain ratio), CART (Gini index) address ID3's limitations