MathIsimple
Back to Machine Learning
Decision Trees
Beginner to Advanced

Decision Trees

Master decision tree algorithms including ID3, C4.5, CART, and pruning techniques

Entropy & Information Gain

Measuring Impurity

Entropy

Entropy measures the disorder or uncertainty in a dataset:

H(D)=k=1Kpklog2(pk)H(D) = -\sum_{k=1}^{K} p_k \log_2(p_k)

• = proportion of class in dataset

• when all samples belong to one class (pure)

• for binary classification with equal split (most impure)

Information Gain

Reduction in entropy after splitting on attribute :

Gain(D,A)=H(D)vValues(A)DvDH(Dv)\text{Gain}(D, A) = H(D) - \sum_{v \in \text{Values}(A)} \frac{|D^v|}{|D|} H(D^v)

Choose attribute with maximum information gain for splitting

Example: Binary Split

Dataset: 9 positive, 5 negative

H(D)=914log2(914)514log2(514)H(D) = -\frac{9}{14}\log_2(\frac{9}{14}) - \frac{5}{14}\log_2(\frac{5}{14})=0.940 bits= 0.940 \text{ bits}

After Split on A

Left: 7 pos, 1 neg; Right: 2 pos, 4 neg

H(Dleft)=0.544,H(Dright)=0.918H(D^{\text{left}}) = 0.544,\, H(D^{\text{right}}) = 0.918Gain=0.940(814×0.544+614×0.918)=0.151\text{Gain} = 0.940 - (\frac{8}{14} \times 0.544 + \frac{6}{14} \times 0.918) = 0.151

Gini Index

Alternative Impurity Measure (CART Algorithm)

Gini index measures the probability of misclassification:

Gini(D)=1k=1Kpk2\text{Gini}(D) = 1 - \sum_{k=1}^{K} p_k^2

• when all samples belong to one class

• for binary classification with equal split

• Lower Gini = purer node

Gini Gain

Gini_Gain(D,A)=Gini(D)vDvDGini(Dv)\text{Gini\_Gain}(D, A) = \text{Gini}(D) - \sum_{v} \frac{|D^v|}{|D|} \text{Gini}(D^v)

Entropy vs Gini

  • • Both measure impurity
  • • Gini: faster to compute (no logarithm)
  • • Entropy: information-theoretic interpretation
  • • Results usually similar in practice

Example Calculation

9 positive, 5 negative:

Gini=1[(914)2+(514)2]\text{Gini} = 1 - [(\frac{9}{14})^2 + (\frac{5}{14})^2]=1(0.410+0.127)=0.463= 1 - (0.410 + 0.127) = 0.463

Decision Tree Algorithms

ID3

Criterion

Information Gain (Entropy)

Splits

Multi-way (one branch per value)

Limitations

  • • Only categorical features
  • • Bias toward many-valued attributes
  • • No pruning
C4.5

Criterion

Gain Ratio

Gain(D,A)SplitInfo(D,A)\frac{\text{Gain}(D,A)}{\text{SplitInfo}(D,A)}

Improvements

  • • Handles continuous attributes
  • • Handles missing values
  • • Post-pruning
  • • Fixes multi-value bias
CART

Criterion

Gini Index

Splits

Binary (two branches only)

Features

  • • Classification & regression
  • • Cost-complexity pruning
  • • Handles continuous values
Gain Ratio (C4.5)

Addresses ID3's bias toward attributes with many values:

Split Information

SplitInfo(D,A)=v=1VDvDlog2(DvD)\text{SplitInfo}(D,A) = -\sum_{v=1}^{V} \frac{|D^v|}{|D|} \log_2(\frac{|D^v|}{|D|})

Measures how broadly and uniformly the attribute splits data

Gain Ratio

GainRatio(D,A)=Gain(D,A)SplitInfo(D,A)\text{GainRatio}(D,A) = \frac{\text{Gain}(D,A)}{\text{SplitInfo}(D,A)}

Penalizes attributes that split data into many small subsets

Pruning

Preventing Overfitting

Pruning simplifies the tree by removing branches with little predictive power, improving generalization.

Pre-Pruning (Early Stopping)

Stop tree growth early based on criteria:

  • Max depth: Limit tree depth
  • Min samples split: Don't split if node has < N samples
  • Min gain: Stop if information gain < threshold
  • Max leaves: Limit number of leaf nodes

✓ Fast, no separate validation needed

✗ May stop too early

Post-Pruning

Grow full tree, then prune back:

  • Reduced-error: Remove if validation error doesn't increase
  • Cost-complexity (CART): Balance tree size and error
  • Pessimistic error: Estimate error with confidence intervals

✓ Better results, no premature stopping

✗ Computationally expensive

Cost-Complexity Pruning (CART)

Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha |T|

where is misclassification error, is number of leaves, balances complexity. Find subtree that minimizes using cross-validation.

Handling Special Cases

Continuous Attributes

For continuous attributes, find optimal threshold:

Algorithm:

  1. 1. Sort samples by attribute value
  2. 2. Candidate thresholds: midpoints between consecutive values
  3. 3. For each threshold : split into and $A > t$
  4. 4. Choose that maximizes information gain

Example:

Age values: {10, 20, 30, 40}
Test thresholds: 15, 25, 35
Choose best split: Age ≤ 25 vs Age > 25

Missing Values

Three strategies for handling missing attribute values:

1. Most Common Value

Assign most frequent value among samples at node

2. Most Common by Class

Use most frequent value within the same class

3. Probability-based Split

Assign fractional samples to each branch based on split proportions (C4.5 approach)

Example: Customer Churn Prediction

Building a Decision Tree

Dataset (10 customers)

ContractSupport CallsTenure (months)Churned?
Month-to-Month53
Yes
Annual124
No
Month-to-Month42
Yes
Annual218
No
Month-to-Month36
Yes
Biennial036
No
Annual112
No
Month-to-Month61
Yes
Biennial148
No
Month-to-Month28
No

Step 1: Calculate Root Entropy

5 churned, 5 didn't churn

H(D)=510log2(510)510log2(510)=1.0 bitH(D) = -\frac{5}{10}\log_2(\frac{5}{10}) - \frac{5}{10}\log_2(\frac{5}{10}) = 1.0 \text{ bit}

Step 2: Calculate Information Gain for Contract

• Month-to-Month: 5 customers (4 churn, 1 no) →

• Annual: 3 customers (0 churn, 3 no) →

• Biennial: 2 customers (0 churn, 2 no) →

Gain=1.0(510×0.722+310×0+210×0)=0.639\text{Gain} = 1.0 - (\frac{5}{10} \times 0.722 + \frac{3}{10} \times 0 + \frac{2}{10} \times 0) = 0.639

Resulting Tree

Contract Type

├── Month-to-Month → [Further split on Support Calls]

├── Annual → No Churn

└── Biennial → No Churn

Entropy: Mathematical Foundation

Information Theory and Impurity Measures

Entropy Definition

H(S)=i=1Kpilog2piH(S) = -\sum_{i=1}^{K} p_i \log_2 p_i

where p_i is the proportion of class i in set S. Measures average "surprise" or uncertainty.

Convention: 0 log 0 = 0 (limiting case as p → 0)

Key Properties of Entropy

Property 1: Non-negativity

H(S)0H(S) \geq 0

Proof: Since 0 ≤ p_i ≤ 1, we have log p_i ≤ 0, so -p_i log p_i ≥ 0. Sum of non-negative terms is non-negative.

Property 2: Maximum Entropy

H(S)log2KH(S) \leq \log_2 K

Proof (using Lagrange multipliers):

Maximize H = -Σp_i log p_i subject to Σp_i = 1

Form Lagrangian: L = -Σp_i log p_i - λ(Σp_i - 1)

∂L/∂p_i = -log p_i - 1/ln(2) - λ = 0

This gives p_i = constant = 1/K (uniform distribution)

Maximum entropy: H_max = -Σ(1/K)log(1/K) = log K

Uniform distribution has maximum uncertainty!

Property 3: Concavity

H(αS1+(1α)S2)αH(S1)+(1α)H(S2)H(\alpha S_1 + (1-\alpha)S_2) \geq \alpha H(S_1) + (1-\alpha)H(S_2)

Entropy is a concave function of the probability distribution. This means mixing distributions increases uncertainty.

Property 4: Additivity

For independent variables X and Y:

H(X,Y)=H(X)+H(Y)H(X,Y) = H(X) + H(Y)

Binary Entropy Function

For binary classification with positive proportion p:

H(p)=plog2p(1p)log2(1p)H(p) = -p\log_2 p - (1-p)\log_2(1-p)

Maximum at p = 0.5 (50-50 split), minimum at p = 0 or p = 1 (pure).

H(0.5)=1 bit,H(0)=H(1)=0 bitsH(0.5) = 1 \text{ bit}, \quad H(0) = H(1) = 0 \text{ bits}

Information Gain Detailed Derivation

Measuring Impurity Reduction from Splits

Information Gain Formula

IG(S,A)=H(S)vValues(A)SvSH(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)

where S_v is the subset with attribute A = v.

Information Gain = Parent Entropy - Weighted Average of Child Entropies

Worked Example: Complete Calculation

Dataset: Play Tennis (14 examples, 9 Yes, 5 No)

Step 1: Parent Entropy

H(S)=914log2914514log2514H(S) = -\frac{9}{14}\log_2\frac{9}{14} - \frac{5}{14}\log_2\frac{5}{14}=0.643(0.637)0.357(1.485)=0.940 bits= -0.643(-0.637) - 0.357(-1.485) = 0.940 \text{ bits}

Step 2: Split on "Outlook" (Sunny/Overcast/Rain)

  • • Sunny: 5 examples (2 Yes, 3 No)
  • • Overcast: 4 examples (4 Yes, 0 No)
  • • Rain: 5 examples (3 Yes, 2 No)

Step 3: Calculate Child Entropies

H(Sunny) = -(2/5)log(2/5) - (3/5)log(3/5) = 0.971 bits

H(Overcast) = 0 bits (pure!)

H(Rain) = -(3/5)log(3/5) - (2/5)log(2/5) = 0.971 bits

Step 4: Weighted Average

Hsplit=514(0.971)+414(0)+514(0.971)=0.693 bitsH_{\text{split}} = \frac{5}{14}(0.971) + \frac{4}{14}(0) + \frac{5}{14}(0.971) = 0.693 \text{ bits}

Step 5: Information Gain

IG(S,Outlook)=0.9400.693=0.247 bitsIG(S, \text{Outlook}) = 0.940 - 0.693 = 0.247 \text{ bits}

This split reduces uncertainty by 0.247 bits! Choose attribute with highest IG.

Information Gain Ratio (C4.5 Improvement)

Problem: IG favors attributes with many values (e.g., ID has perfect IG but no generalization).

Solution: Normalize by split information:

GainRatio(S,A)=IG(S,A)SplitInfo(S,A)\text{GainRatio}(S,A) = \frac{IG(S,A)}{\text{SplitInfo}(S,A)}SplitInfo(S,A)=vSvSlog2SvS\text{SplitInfo}(S,A) = -\sum_{v} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}

SplitInfo penalizes attributes that create many small partitions.

Gini Index Mathematical Analysis

Alternative Impurity Measure Used in CART

Gini Impurity Definition

Gini(S)=1i=1Kpi2\text{Gini}(S) = 1 - \sum_{i=1}^{K} p_i^2

where p_i is the proportion of class i. Measures probability of misclassifying a random sample.

Probabilistic Interpretation:

If we pick a random sample from S and randomly assign it a class according to the distribution in S, Gini(S) is the probability of incorrect classification.

Gini vs Entropy: Mathematical Comparison

Gini Index

Gini=1pi2\text{Gini} = 1 - \sum p_i^2
  • • Range: [0, 1-1/K]
  • • Binary max: 0.5 at p=0.5
  • • Quadratic penalty
  • • Computationally faster (no logs)
  • • Used in CART, sklearn default

Entropy

H=pilog2piH = -\sum p_i \log_2 p_i
  • • Range: [0, log K]
  • • Binary max: 1 at p=0.5
  • • Logarithmic penalty
  • • More expensive computation
  • • Used in ID3, C4.5

Taylor Expansion Shows Similarity

For binary case with p_1 = p, p_2 = 1-p:

Gini(p)=2p(1p)\text{Gini}(p) = 2p(1-p)H(p)=plogp(1p)log(1p)H(p) = -p\log p - (1-p)\log(1-p)

Taylor expansion of H around p=0.5:

H(p)1(2p1)22ln2=11.44(p0.5)2H(p) \approx 1 - \frac{(2p-1)^2}{2\ln 2} = 1 - 1.44(p-0.5)^2

Both are quadratic near p=0.5, explaining similar behavior in practice!

Gini Gain for Splits

ΔGini(S,A)=Gini(S)vSvSGini(Sv)\Delta\text{Gini}(S, A) = \text{Gini}(S) - \sum_{v} \frac{|S_v|}{|S|} \text{Gini}(S_v)

Choose split that maximizes Gini gain (minimizes weighted child impurity).

Computational Complexity Analysis

Time Complexity of Decision Tree Construction

Splitting a Node: Categorical Features

Problem: Find best split among m features for n samples.

For each feature:

  • • Scan all n samples: O(n)
  • • Count classes in each partition: O(n)
  • • Calculate impurity: O(K) where K = number of classes

Total per feature: O(n × K)

For m features:

Tcategorical=O(m×n×K)T_{\text{categorical}} = O(m \times n \times K)

Splitting: Continuous Features (Sorted)

Better approach: Sort once, then scan for best threshold.

Algorithm:

  1. Sort samples by feature value: O(n log n)
  2. Initialize counts for left (empty) and right (all samples)
  3. Scan sorted list, moving samples left to right:
    • • Update counts: O(1) per sample
    • • Calculate impurity: O(K)
    • • n possible thresholds: O(n × K)

Per feature: O(n log n + nK)

For m features:

Tcontinuous=O(m×nlogn)T_{\text{continuous}} = O(m \times n \log n)

(assuming K < log n, which is typical)

Full Tree Construction

Worst case: Completely imbalanced tree (one sample per leaf)

• Tree depth: O(n)

• At depth d: process (n-d) samples

• Total: Σ(n-d) × m × (n-d) log(n-d) for d=0 to n

Tworst=O(m×n2logn)T_{\text{worst}} = O(m \times n^2 \log n)

Best case: Balanced tree

• Tree depth: O(log n)

• At each level: process all n samples

• log n levels × (mn log n) per level

Tbest=O(m×nlog2n)T_{\text{best}} = O(m \times n \log^2 n)

This is the expected case for reasonable data!

Prediction Complexity

For a single sample:

  • • Start at root, follow path to leaf
  • • Each node: O(1) comparison
  • • Tree depth: d
Tpredict=O(depth)=O(logn) average, O(n) worstT_{\text{predict}} = O(\text{depth}) = O(\log n) \text{ average, } O(n) \text{ worst}

Very fast prediction, especially compared to O(n) for k-NN!

Pruning Algorithms & Theory

Cost-Complexity Pruning (CART Method)

Cost-Complexity Criterion

For a subtree T, define the cost-complexity measure:

Cα(T)=R(T)+αTC_\alpha(T) = R(T) + \alpha|T|

• R(T) = misclassification rate (or sum of impurities) of tree T

• |T| = number of leaf nodes in T

• α ≥ 0 = complexity parameter (regularization)

Goal: Find subtree T_α that minimizes C_α(T) for given α.

Weakest Link Pruning Algorithm

  1. Grow full tree T_0 (no pruning)
  2. For each internal node t with leaves L_t:

    Compute effective α for pruning t:

    αt=R(t)R(Tt)Tt1\alpha_t = \frac{R(t) - R(T_t)}{|T_t| - 1}

    where T_t is subtree rooted at t

  3. Find node t* with smallest α_t:

    α=mintαt\alpha^* = \min_t \alpha_t
  4. Prune subtree T_t* (replace with leaf)
  5. Repeat steps 2-4 until only root remains
  6. Result: Sequence T_0 ⊃ T_1 ⊃ ... ⊃ [root]

Optimal Subtree Theorem:

For each α, the tree T_α in this sequence minimizes C_α among all possible subtrees.

Cross-Validation for α Selection

  1. Split data into K folds
  2. For each fold k:
    • • Train full tree on other K-1 folds
    • • Generate pruning sequence with α values
    • • Evaluate each T_α on validation fold k
  3. Average errors across folds for each α
  4. Select α that minimizes average validation error
  5. Apply selected α to full training set

Reduced Error Pruning (Simpler Alternative)

Requires separate validation set:

  1. Grow full tree on training set
  2. For each internal node:
    • • Temporarily prune (make it a leaf)
    • • Measure error on validation set
    • • If error doesn't increase, keep pruned
  3. Repeat until no beneficial pruning remains

Pro: Simpler, guaranteed not to increase validation error.
Con: Requires separate validation set (less data for training).

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is entropy in the context of decision trees?
Not attempted
2
For a binary classification with 8 positive and 2 negative samples, what is the entropy?
Not attempted
3
Information gain is calculated as:
Not attempted
4
What is the Gini index for a dataset with equal distribution of 3 classes?
Not attempted
5
Why does ID3 have a bias toward attributes with many values?
Not attempted
6
C4.5 uses gain ratio instead of information gain. Gain ratio is:
Not attempted
7
What is the main purpose of pruning in decision trees?
Not attempted
8
For a continuous attribute like 'Age', how do decision trees typically handle splits?
Not attempted
9
How does CART (Classification and Regression Trees) differ from ID3?
Not attempted
10
What is reduced-error pruning?
Not attempted