Entropy and Information Gain in Decision Trees: A Practical Guide

Here's a question that kept early machine learning researchers up at night: if you're building an algorithm to sort data into categories, which question should it ask first? Ask the wrong one, and you waste precious splits. Ask the right one, and suddenly the problem becomes trivially easy.

This isn't just an academic puzzle. It's the foundation of how decision trees actually work—and the answer comes down to two concepts borrowed from information theory: entropy and information gain.

What Decision Trees Are Really Doing

Before we get into the math, let's be clear about what we're trying to accomplish. A decision tree is essentially a flowchart that asks a series of yes-or-no questions about your data. Each question splits the data into smaller groups. The end goal? Get to a point where each group contains only one type of thing.

Think about how you'd organize a messy garage. You might start by separating "things I use regularly" from "things I haven't touched in years." That's a high-value split—it immediately makes both piles more manageable. Separating items by color? Probably not as useful.

The algorithm faces the same challenge. Given a dozen possible questions it could ask, it needs a systematic way to figure out which one will clean up the data fastest.

Entropy: Measuring How Messy Your Data Is

Entropy, in the context of machine learning, measures impurity. A dataset with high entropy is messy—mixed up, unpredictable. A dataset with low entropy is clean—mostly or entirely one category.

Here's the intuition: imagine you have a bag of marbles. If every marble is red, and someone asks you to guess the color of the next marble you'll pull out, that's easy. Zero uncertainty. But if the bag is half red and half blue? Now you're genuinely unsure. That uncertainty is what entropy captures.

The entropy formula for binary classification:

H(S) = -p_1 \log_2(p_1) - p_0 \log_2(p_0)

Where p₁ is the proportion of positive cases and p₀ is the proportion of negative cases. The log base 2 gives us units in bits—handy because one bit corresponds to one yes-or-no question.

A few things worth noting about this formula:

Entropy hits its maximum at a 50/50 split. When you have equal numbers of both classes, uncertainty is as high as it gets. H = 1 bit.
Entropy drops to zero when data is pure. If 100% of your samples belong to one class, there's nothing to predict. H = 0.
The logarithm matters. It's not arbitrary—logarithms convert multiplicative relationships into additive ones, which makes the math work out nicely when combining probabilities.

Information Gain: Finding Questions That Actually Help

Now here's where it gets practical. We know how to measure messiness. The next step is measuring how much a specific question reduces that messiness.

That's information gain. It's calculated by comparing the entropy before a split to the weighted average entropy after the split. If asking "Is X greater than 5?" takes your entropy from 0.9 down to 0.4, you've gained 0.5 bits of information.

Information Gain formula:

Gain(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \cdot H(S_v)

In plain English: take the original entropy, then subtract the weighted average of the child entropies. Whatever's left is how much information you gained.

Working Through a Real Example

Let's make this concrete. We'll use a classic dataset from machine learning textbooks: predicting whether someone will play tennis based on the weather.

Day	Outlook	Temp	Humidity	Wind	Play?
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
3	Overcast	Hot	High	Weak	Yes
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
6	Rain	Cool	Normal	Strong	No
7	Overcast	Cool	Normal	Strong	Yes
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
10	Rain	Mild	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes
12	Overcast	Mild	High	Strong	Yes
13	Overcast	Hot	Normal	Weak	Yes
14	Rain	Mild	High	Strong	No

We have 14 observations: 9 days where tennis was played, 5 days where it wasn't. That's roughly a 64/36 split.

Calculating the Starting Entropy

First, let's figure out how much uncertainty we're dealing with before making any splits:

$p_{Yes} = 9/14 = 0.643$

$p_{No} = 5/14 = 0.357$

H(S) = -(0.643)(\log_2 0.643) - (0.357)(\log_2 0.357)

$H(S) = 0.940 \text{ bits}$

So we're starting with 0.94 bits of entropy. Not quite at the maximum of 1.0, but pretty darn mixed.

Testing "Outlook" as Our First Split

What happens if we split on the Outlook feature? It has three possible values: Sunny, Overcast, and Rain. Let's see how each branch looks:

Sunny (5 days)

2 Yes, 3 No

$H = 0.971$

Overcast (4 days)

4 Yes, 0 No — pure!

$H = 0$

Rain (5 days)

3 Yes, 2 No

$H = 0.971$

Notice something interesting? The Overcast branch is completely pure. Every single overcast day in our dataset resulted in tennis being played. That branch is done—no further splitting needed.

The weighted average entropy after this split:

H(S|Outlook) = \frac{5}{14}(0.971) + \frac{4}{14}(0) + \frac{5}{14}(0.971) = 0.694

And our information gain:

Gain(S, Outlook) = 0.940 - 0.694 = 0.246 \text{ bits}

Comparing All the Features

We'd repeat this calculation for every feature—Temperature, Humidity, Wind—and compare the results:

Feature	Information Gain (bits)
Outlook	0.246
Humidity	0.152
Wind	0.048
Temperature	0.029

Outlook wins. It provides the most information, so it becomes our root node. The algorithm then recursively applies the same logic to each non-pure branch, building out the tree until every leaf node contains only one class.

Why This Matters Beyond the Textbook

Understanding entropy and information gain isn't just about passing a machine learning exam. These concepts show up everywhere:

Feature selection: Information gain helps identify which variables actually matter in your dataset and which ones are just noise.
Model interpretability: Unlike black-box models, decision trees give you a clear, auditable trail of logic. You can explain why it made a particular prediction.
Data exploration: High information gain on a feature tells you it has strong predictive power. That's useful even if you end up using a different algorithm.

Common Questions About Entropy Splitting

What about Gini impurity? Many implementations (including scikit-learn by default) use Gini impurity instead of entropy. The two measures are similar—both penalize impurity—but Gini is slightly faster to compute since it doesn't involve logarithms. In practice, they usually produce nearly identical trees.

Does the highest gain always give the best tree? Not necessarily. The algorithm is greedy—it picks the best split at each step without looking ahead. Sometimes a locally suboptimal split leads to a better overall tree. That said, for most practical purposes, greedy splitting works remarkably well.

Can entropy handle continuous variables? Yes, but you need to test different threshold values. For a feature like "age," the algorithm would evaluate splits like "age < 30," "age < 35," and so on, calculating information gain for each potential cutoff.

Wrapping Up

The elegance of entropy-based splitting lies in its simplicity. Ask "how messy is my data?", make a split, ask "how much cleaner did it get?", and repeat. That's the entire decision tree algorithm at its core.

Of course, practical implementations add complexity—pruning to prevent overfitting, handling missing values, supporting regression tasks—but the fundamental idea remains: ask questions that reduce uncertainty as quickly as possible.

Ready to learn more?

Our full Decision Trees course covers advanced topics like pruning strategies, handling overfitting, and when to choose decision trees over other algorithms.