MathIsimple – Simple, Friendly Math Tools & Learning

Entropy & Information Gain

Measuring Impurity

Entropy

Entropy measures the disorder or uncertainty in a dataset:

H(D) = -\sum_{k=1}^{K} p_k \log_2(p_k)

• = proportion of class in dataset

• when all samples belong to one class (pure)

• for binary classification with equal split (most impure)

Information Gain

Reduction in entropy after splitting on attribute :

\text{Gain}(D, A) = H(D) - \sum_{v \in \text{Values}(A)} \frac{|D^v|}{|D|} H(D^v)

Choose attribute with maximum information gain for splitting

Example: Binary Split

Dataset: 9 positive, 5 negative

H(D) = -\frac{9}{14}\log_2(\frac{9}{14}) - \frac{5}{14}\log_2(\frac{5}{14})

= 0.940 \text{ bits}

After Split on A

Left: 7 pos, 1 neg; Right: 2 pos, 4 neg

H(D^{\text{left}}) = 0.544,\, H(D^{\text{right}}) = 0.918

\text{Gain} = 0.940 - (\frac{8}{14} \times 0.544 + \frac{6}{14} \times 0.918) = 0.151

Gini Index

Alternative Impurity Measure (CART Algorithm)

Gini index measures the probability of misclassification:

\text{Gini}(D) = 1 - \sum_{k=1}^{K} p_k^2

• when all samples belong to one class

• for binary classification with equal split

• Lower Gini = purer node

Gini Gain

\text{Gini\_Gain}(D, A) = \text{Gini}(D) - \sum_{v} \frac{|D^v|}{|D|} \text{Gini}(D^v)

Entropy vs Gini

• Both measure impurity
• Gini: faster to compute (no logarithm)
• Entropy: information-theoretic interpretation
• Results usually similar in practice

Example Calculation

9 positive, 5 negative:

\text{Gini} = 1 - [(\frac{9}{14})^2 + (\frac{5}{14})^2]

= 1 - (0.410 + 0.127) = 0.463

Decision Tree Algorithms

ID3

Criterion

Information Gain (Entropy)

Splits

Multi-way (one branch per value)

Limitations

• Only categorical features
• Bias toward many-valued attributes
• No pruning

C4.5

Criterion

Gain Ratio

\frac{\text{Gain}(D,A)}{\text{SplitInfo}(D,A)}

Improvements

• Handles continuous attributes
• Handles missing values
• Post-pruning
• Fixes multi-value bias

CART

Criterion

Gini Index

Splits

Binary (two branches only)

Features

• Classification & regression
• Cost-complexity pruning
• Handles continuous values

Gain Ratio (C4.5)

Addresses ID3's bias toward attributes with many values:

Split Information

\text{SplitInfo}(D,A) = -\sum_{v=1}^{V} \frac{|D^v|}{|D|} \log_2(\frac{|D^v|}{|D|})

Measures how broadly and uniformly the attribute splits data

Gain Ratio

\text{GainRatio}(D,A) = \frac{\text{Gain}(D,A)}{\text{SplitInfo}(D,A)}

Penalizes attributes that split data into many small subsets

Pruning

Preventing Overfitting

Pruning simplifies the tree by removing branches with little predictive power, improving generalization.

Pre-Pruning (Early Stopping)

Stop tree growth early based on criteria:

• Max depth: Limit tree depth
• Min samples split: Don't split if node has < N samples
• Min gain: Stop if information gain < threshold
• Max leaves: Limit number of leaf nodes

✓ Fast, no separate validation needed

✗ May stop too early

Post-Pruning

Grow full tree, then prune back:

• Reduced-error: Remove if validation error doesn't increase
• Cost-complexity (CART): Balance tree size and error
• Pessimistic error: Estimate error with confidence intervals

✓ Better results, no premature stopping

✗ Computationally expensive

Cost-Complexity Pruning (CART)

R_\alpha(T) = R(T) + \alpha |T|

where is misclassification error, is number of leaves, balances complexity. Find subtree that minimizes using cross-validation.

Handling Special Cases

Continuous Attributes

For continuous attributes, find optimal threshold:

Algorithm:

1. Sort samples by attribute value
2. Candidate thresholds: midpoints between consecutive values
3. For each threshold : split into and $A > t$
4. Choose that maximizes information gain

Example:

Age values: {10, 20, 30, 40}
Test thresholds: 15, 25, 35
Choose best split: Age ≤ 25 vs Age > 25

Missing Values

Three strategies for handling missing attribute values:

1. Most Common Value

Assign most frequent value among samples at node

2. Most Common by Class

Use most frequent value within the same class

3. Probability-based Split

Assign fractional samples to each branch based on split proportions (C4.5 approach)

Example: Customer Churn Prediction

Building a Decision Tree

Dataset (10 customers)

Contract	Support Calls	Tenure (months)	Churned?
Month-to-Month	5	3	Yes
Annual	1	24	No
Month-to-Month	4	2	Yes
Annual	2	18	No
Month-to-Month	3	6	Yes
Biennial	0	36	No
Annual	1	12	No
Month-to-Month	6	1	Yes
Biennial	1	48	No
Month-to-Month	2	8	No

Step 1: Calculate Root Entropy

5 churned, 5 didn't churn

H(D) = -\frac{5}{10}\log_2(\frac{5}{10}) - \frac{5}{10}\log_2(\frac{5}{10}) = 1.0 \text{ bit}

Step 2: Calculate Information Gain for Contract

• Month-to-Month: 5 customers (4 churn, 1 no) →

• Annual: 3 customers (0 churn, 3 no) →

• Biennial: 2 customers (0 churn, 2 no) →

\text{Gain} = 1.0 - (\frac{5}{10} \times 0.722 + \frac{3}{10} \times 0 + \frac{2}{10} \times 0) = 0.639

Resulting Tree

Contract Type

├── Month-to-Month → [Further split on Support Calls]

├── Annual → No Churn

└── Biennial → No Churn

View More Examples & Exercises

Entropy: Mathematical Foundation

Information Theory and Impurity Measures

Entropy Definition

H(S) = -\sum_{i=1}^{K} p_i \log_2 p_i

where p_i is the proportion of class i in set S. Measures average "surprise" or uncertainty.

Convention: 0 log 0 = 0 (limiting case as p → 0)

Key Properties of Entropy

Property 1: Non-negativity

H(S) \geq 0

Proof: Since 0 ≤ p_i ≤ 1, we have log p_i ≤ 0, so -p_i log p_i ≥ 0. Sum of non-negative terms is non-negative.

Property 2: Maximum Entropy

H(S) \leq \log_2 K

Proof (using Lagrange multipliers):

Maximize H = -Σp_i log p_i subject to Σp_i = 1

Form Lagrangian: L = -Σp_i log p_i - λ(Σp_i - 1)

∂L/∂p_i = -log p_i - 1/ln(2) - λ = 0

This gives p_i = constant = 1/K (uniform distribution)

Maximum entropy: H_max = -Σ(1/K)log(1/K) = log K

Uniform distribution has maximum uncertainty!

Property 3: Concavity

H(\alpha S_1 + (1-\alpha)S_2) \geq \alpha H(S_1) + (1-\alpha)H(S_2)

Entropy is a concave function of the probability distribution. This means mixing distributions increases uncertainty.

Property 4: Additivity

For independent variables X and Y:

H(X,Y) = H(X) + H(Y)

Binary Entropy Function

For binary classification with positive proportion p:

H(p) = -p\log_2 p - (1-p)\log_2(1-p)

Maximum at p = 0.5 (50-50 split), minimum at p = 0 or p = 1 (pure).

H(0.5) = 1 \text{ bit}, \quad H(0) = H(1) = 0 \text{ bits}

Information Gain Detailed Derivation

Measuring Impurity Reduction from Splits

Information Gain Formula

IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)

where S_v is the subset with attribute A = v.

Information Gain = Parent Entropy - Weighted Average of Child Entropies

Worked Example: Complete Calculation

Dataset: Play Tennis (14 examples, 9 Yes, 5 No)

Step 1: Parent Entropy

H(S) = -\frac{9}{14}\log_2\frac{9}{14} - \frac{5}{14}\log_2\frac{5}{14}

= -0.643(-0.637) - 0.357(-1.485) = 0.940 \text{ bits}

Step 2: Split on "Outlook" (Sunny/Overcast/Rain)

• Sunny: 5 examples (2 Yes, 3 No)
• Overcast: 4 examples (4 Yes, 0 No)
• Rain: 5 examples (3 Yes, 2 No)

Step 3: Calculate Child Entropies

H(Sunny) = -(2/5)log(2/5) - (3/5)log(3/5) = 0.971 bits

H(Overcast) = 0 bits (pure!)

H(Rain) = -(3/5)log(3/5) - (2/5)log(2/5) = 0.971 bits

Step 4: Weighted Average

H_{\text{split}} = \frac{5}{14}(0.971) + \frac{4}{14}(0) + \frac{5}{14}(0.971) = 0.693 \text{ bits}

Step 5: Information Gain

IG(S, \text{Outlook}) = 0.940 - 0.693 = 0.247 \text{ bits}

This split reduces uncertainty by 0.247 bits! Choose attribute with highest IG.

Information Gain Ratio (C4.5 Improvement)

Problem: IG favors attributes with many values (e.g., ID has perfect IG but no generalization).

Solution: Normalize by split information:

\text{GainRatio}(S,A) = \frac{IG(S,A)}{\text{SplitInfo}(S,A)}

\text{SplitInfo}(S,A) = -\sum_{v} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}

SplitInfo penalizes attributes that create many small partitions.

Gini Index Mathematical Analysis

Alternative Impurity Measure Used in CART

Gini Impurity Definition

\text{Gini}(S) = 1 - \sum_{i=1}^{K} p_i^2

where p_i is the proportion of class i. Measures probability of misclassifying a random sample.

Probabilistic Interpretation:

If we pick a random sample from S and randomly assign it a class according to the distribution in S, Gini(S) is the probability of incorrect classification.

Gini vs Entropy: Mathematical Comparison

Gini Index

\text{Gini} = 1 - \sum p_i^2

• Range: [0, 1-1/K]
• Binary max: 0.5 at p=0.5
• Quadratic penalty
• Computationally faster (no logs)
• Used in CART, sklearn default

Entropy

H = -\sum p_i \log_2 p_i

• Range: [0, log K]
• Binary max: 1 at p=0.5
• Logarithmic penalty
• More expensive computation
• Used in ID3, C4.5

Taylor Expansion Shows Similarity

For binary case with p_1 = p, p_2 = 1-p:

\text{Gini}(p) = 2p(1-p)

H(p) = -p\log p - (1-p)\log(1-p)

Taylor expansion of H around p=0.5:

H(p) \approx 1 - \frac{(2p-1)^2}{2\ln 2} = 1 - 1.44(p-0.5)^2

Both are quadratic near p=0.5, explaining similar behavior in practice!

Gini Gain for Splits

\Delta\text{Gini}(S, A) = \text{Gini}(S) - \sum_{v} \frac{|S_v|}{|S|} \text{Gini}(S_v)

Choose split that maximizes Gini gain (minimizes weighted child impurity).

Computational Complexity Analysis

Time Complexity of Decision Tree Construction

Splitting a Node: Categorical Features

Problem: Find best split among m features for n samples.

For each feature:

• Scan all n samples: O(n)
• Count classes in each partition: O(n)
• Calculate impurity: O(K) where K = number of classes

Total per feature: O(n × K)

For m features:

T_{\text{categorical}} = O(m \times n \times K)

Splitting: Continuous Features (Sorted)

Better approach: Sort once, then scan for best threshold.

Algorithm:

Sort samples by feature value: O(n log n)
Initialize counts for left (empty) and right (all samples)
Scan sorted list, moving samples left to right:
- • Update counts: O(1) per sample
- • Calculate impurity: O(K)
- • n possible thresholds: O(n × K)

Per feature: O(n log n + nK)

For m features:

T_{\text{continuous}} = O(m \times n \log n)

(assuming K < log n, which is typical)

Full Tree Construction

Worst case: Completely imbalanced tree (one sample per leaf)

• Tree depth: O(n)

• At depth d: process (n-d) samples

• Total: Σ(n-d) × m × (n-d) log(n-d) for d=0 to n

T_{\text{worst}} = O(m \times n^2 \log n)

Best case: Balanced tree

• Tree depth: O(log n)

• At each level: process all n samples

• log n levels × (mn log n) per level

T_{\text{best}} = O(m \times n \log^2 n)

This is the expected case for reasonable data!

Prediction Complexity

For a single sample:

• Start at root, follow path to leaf
• Each node: O(1) comparison
• Tree depth: d

T_{\text{predict}} = O(\text{depth}) = O(\log n) \text{ average, } O(n) \text{ worst}

Very fast prediction, especially compared to O(n) for k-NN!

Pruning Algorithms & Theory

Cost-Complexity Pruning (CART Method)

Cost-Complexity Criterion

For a subtree T, define the cost-complexity measure:

C_\alpha(T) = R(T) + \alpha|T|

• R(T) = misclassification rate (or sum of impurities) of tree T

• |T| = number of leaf nodes in T

• α ≥ 0 = complexity parameter (regularization)

Goal: Find subtree T_α that minimizes C_α(T) for given α.

Weakest Link Pruning Algorithm

Grow full tree T_0 (no pruning)
For each internal node t with leaves L_t:
Compute effective α for pruning t:
$\alpha_t = \frac{R(t) - R(T_t)}{|T_t| - 1}$
where T_t is subtree rooted at t
Find node t* with smallest α_t:
$\alpha^* = \min_t \alpha_t$
Prune subtree T_t* (replace with leaf)
Repeat steps 2-4 until only root remains
Result: Sequence T_0 ⊃ T_1 ⊃ ... ⊃ [root]

Optimal Subtree Theorem:

For each α, the tree T_α in this sequence minimizes C_α among all possible subtrees.

Cross-Validation for α Selection

Split data into K folds
For each fold k:
- • Train full tree on other K-1 folds
- • Generate pruning sequence with α values
- • Evaluate each T_α on validation fold k
Average errors across folds for each α
Select α that minimizes average validation error
Apply selected α to full training set

Reduced Error Pruning (Simpler Alternative)

Requires separate validation set:

Grow full tree on training set
For each internal node:
- • Temporarily prune (make it a leaf)
- • Measure error on validation set
- • If error doesn't increase, keep pruned
Repeat until no beneficial pruning remains

Pro: Simpler, guaranteed not to increase validation error.
Con: Requires separate validation set (less data for training).

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

What is entropy in the context of decision trees?

Not attempted

2

For a binary classification with 8 positive and 2 negative samples, what is the entropy?

Not attempted

3

Information gain is calculated as:

Not attempted

4

What is the Gini index for a dataset with equal distribution of 3 classes?

Not attempted

5

Why does ID3 have a bias toward attributes with many values?

Not attempted

6

C4.5 uses gain ratio instead of information gain. Gain ratio is:

Not attempted

7

What is the main purpose of pruning in decision trees?

Not attempted

8

For a continuous attribute like 'Age', how do decision trees typically handle splits?

Not attempted

9

How does CART (Classification and Regression Trees) differ from ID3?

Not attempted

10

What is reduced-error pruning?

Not attempted

Decision Trees

Entropy & Information Gain

Entropy

Information Gain

Example: Binary Split

After Split on A

Gini Index

Gini Gain

Entropy vs Gini

Example Calculation

Decision Tree Algorithms

Split Information

Gain Ratio

Pruning

Pre-Pruning (Early Stopping)

Post-Pruning

Cost-Complexity Pruning (CART)

Handling Special Cases

Algorithm:

Example: Customer Churn Prediction

Dataset (10 customers)

Step 1: Calculate Root Entropy

Step 2: Calculate Information Gain for Contract

Resulting Tree

Entropy: Mathematical Foundation

Entropy Definition

Key Properties of Entropy

Binary Entropy Function

Information Gain Detailed Derivation

Information Gain Formula

Worked Example: Complete Calculation

Information Gain Ratio (C4.5 Improvement)

Gini Index Mathematical Analysis

Gini Impurity Definition

Gini vs Entropy: Mathematical Comparison

Taylor Expansion Shows Similarity

Gini Gain for Splits

Computational Complexity Analysis

Splitting a Node: Categorical Features

Splitting: Continuous Features (Sorted)

Full Tree Construction

Prediction Complexity

Pruning Algorithms & Theory

Cost-Complexity Criterion

Weakest Link Pruning Algorithm

Cross-Validation for α Selection

Reduced Error Pruning (Simpler Alternative)

Practice Quiz