Decision trees might be the most intuitive algorithm in machine learning. They work exactly like how you make decisions—through a series of questions that progressively narrow down the answer.
Imagine you're a bank loan officer. Faced with a loan application, you might ask:
- Does the applicant have a stable job?
- What's their credit score?
- Is their debt-to-income ratio too high?
Each question divides applicants into different groups. Eventually, you can give a clear judgment for each group: approve or reject.
This is the essence of decision trees—using a series of questions to "split" data into increasingly pure chunks.
But there's a key question: which question should we ask first?
Ask the right one, and the data becomes clear immediately. Ask the wrong one, and things remain a mess.
Entropy: The First "Purity Meter"
In a previous article, we introduced Entropy—it uses a logarithmic formula to measure how "chaotic" your data is:
Higher entropy means more mixed data; lower entropy means purer data. The ID3 and C4.5 decision tree algorithms use entropy to select the best splitting feature.
But entropy has a small drawback: logarithm operations add computational overhead.
Is there a simpler alternative?
Gini Index: Another Purity Measure
The Gini Index is the purity metric used by CART decision trees. Its idea is more intuitive:
If you randomly pick two samples from a group, what's the probability they belong to different classes?
Higher probability = more impure. Lower probability = purer.
The formula:
- p_k: proportion of samples in class k
- Σ(p_k)²: sum of squared proportions for each class
- 1 minus this sum: probability of picking two different classes
Gini Index range (for binary classification):
Gini = 0
Perfectly pure (100% one class)
Gini = 0.5
Maximum impurity (50/50 split)
Compared to entropy, Gini has no logarithm operations and computes faster—which is why Scikit-learn uses it by default.
Complete Example: Building a Decision Tree with Gini
Let's walk through a complete dataset and demonstrate how CART uses Gini to select splitting features.
Scenario: Bank loan approval. The goal is to predict whether an applicant will default.
Dataset (20 records)
| ID | Age | Home Owner | Credit | Income | Default? |
|---|---|---|---|---|---|
| 1 | Young | No | Fair | Low | Yes |
| 2 | Young | No | Fair | Medium | Yes |
| 3 | Young | Yes | Fair | Low | No |
| 4 | Young | Yes | Good | Low | No |
| 5 | Young | No | Good | Medium | No |
| 6 | Middle | No | Fair | Low | Yes |
| 7 | Middle | No | Fair | Medium | No |
| 8 | Middle | Yes | Fair | Medium | No |
| 9 | Middle | Yes | Good | High | No |
| 10 | Middle | Yes | Good | Medium | No |
| 11 | Senior | Yes | Good | High | No |
| 12 | Senior | Yes | Good | Medium | No |
| 13 | Senior | No | Fair | High | Yes |
| 14 | Senior | No | Fair | Medium | Yes |
| 15 | Senior | Yes | Fair | Low | Yes |
| 16 | Senior | No | Good | High | No |
| 17 | Senior | No | Good | Medium | No |
| 18 | Young | No | Fair | High | Yes |
| 19 | Middle | No | Good | High | No |
| 20 | Young | Yes | Good | Medium | No |
Summary: 20 records, 8 defaults (Yes), 12 non-defaults (No)
Step 1: Calculate Initial Gini Index
Before any split, what's the impurity of the entire dataset?
Initial Gini = 0.48 (moderate impurity)
Step 2: Evaluate Each Feature
We'll test all 4 features to find which produces the lowest weighted Gini Index after splitting.
Feature 1: Age
| Age | Count | Yes | No | Gini |
|---|---|---|---|---|
| Young | 7 | 3 | 4 | 0.49 |
| Middle | 6 | 1 | 5 | 0.28 |
| Senior | 7 | 4 | 3 | 0.49 |
Weighted average:
Feature 2: Home Owner
| Home Owner | Count | Yes | No | Gini |
|---|---|---|---|---|
| Yes | 8 | 1 | 7 | 0.22 |
| No | 12 | 7 | 5 | 0.49 |
Weighted average:
Feature 3: Credit Rating
| Credit | Count | Yes | No | Gini |
|---|---|---|---|---|
| Fair | 10 | 7 | 3 | 0.42 |
| Good | 10 | 1 | 9 | 0.18 |
Weighted average:
Feature 4: Income Level
| Income | Count | Yes | No | Gini |
|---|---|---|---|---|
| Low | 5 | 3 | 2 | 0.48 |
| Medium | 9 | 3 | 6 | 0.44 |
| High | 6 | 2 | 4 | 0.44 |
Weighted average:
Step 3: Select the Best Feature
| Feature | Weighted Gini |
|---|---|
| Age | 0.428 |
| Home Owner | 0.382 |
| Credit Rating ✓ | 0.30 |
| Income | 0.45 |
Credit Rating has the lowest weighted Gini (0.30), so it becomes the root node.
Step 4: Recursive Splitting
Next, for the "Credit = Fair" branch (10 samples) and "Credit = Good" branch (10 samples), we repeat the same process to select the next best splitting feature...
This continues until:
- Leaf nodes are completely pure (Gini = 0), or
- A stopping condition is reached (max depth, min samples, etc.)
Gini Index vs. Entropy
| Metric | Formula | Range | Notes |
|---|---|---|---|
| Entropy | -Σ p log₂(p) | 0 ~ 1 | Information theory basis, units in "bits" |
| Gini Index | 1 - Σ p² | 0 ~ 0.5 | No logarithms, faster computation |
In practice, they produce nearly identical trees. Unless you have specific theoretical requirements, either works well.
Key Takeaways
- Decision trees split data using a series of questions, aiming for increasingly pure groups
- Gini Index measures the probability of picking two samples of different classes
- Feature selection: choose the feature that minimizes weighted Gini after splitting
- Gini vs. Entropy: different formulas, nearly identical results in practice