Learn advanced techniques for handling continuous attributes and missing data in decision trees— essential skills for real-world messy datasets
Basic decision tree algorithms (like ID3) assume categorical features with complete data. However, real-world datasets contain continuous attributes (age, price, temperature) and missing values (incomplete surveys, sensor failures, data entry errors). Advanced algorithms like C4.5 and CART include sophisticated methods to handle both challenges.
Features with infinite possible values within a range (e.g., income: $20k-$200k, age: 18-100 years).
Challenges: Can't create branches for every value; need thresholds or discretization.
Samples where some feature values are unknown, marked as NULL, ?, NaN, or empty.
Challenges: Can't test missing attributes; need strategies for splitting and prediction.
Discretization and threshold-based splitting
There are two main approaches to handle continuous features: (1) Pre-discretization—convert to categorical before tree building, or (2) Dynamic threshold selection—find optimal split points during tree construction. Modern algorithms use the second approach for better performance.
Convert continuous attributes into categorical ones by defining ranges (bins). This is done before tree construction.
Divide the range into k equal-sized intervals.
Example: Age 0-100 → 5 bins: [0-20), [20-40), [40-60), [60-80), [80-100]
Simple, but sensitive to outliers
Each bin contains approximately the same number of samples.
Example: 100 samples → 4 bins with 25 samples each
Better handles skewed distributions
Use meaningful thresholds based on domain expertise.
Example: BMI → Underweight (<18.5), Normal (18.5-25), Overweight (25-30), Obese (>30)
Most interpretable, requires expertise
Trade-offs:
During tree construction, dynamically find the optimal split threshold for each continuous attribute at each node. This creates binary splits: (attribute ≤ threshold) vs (attribute > threshold).
Sorted mileage values: 28000, 32000, 41000, 45000, 52000, 61000, 78000, 95000
Candidate 1: mileage ≤ 30000 (1 sample) vs > 30000 (7 samples)
Gain = 0.12 (very imbalanced, low gain)
Candidate 2: mileage ≤ 43000 (3 samples) vs > 43000 (5 samples)
Gain = 0.38 (good balance, high gain) ✓ SELECTED
Candidate 3: mileage ≤ 70000 (6 samples) vs > 70000 (2 samples)
Gain = 0.21 (imbalanced again)
Result: Split becomes "mileage ≤ 43000" (Low) vs "mileage > 43000" (High).
Advantages:
Unlike categorical attributes (which are "used up" after splitting), continuous attributes can be split multiple times at different thresholds in different parts of the tree. Example: Root splits on "age ≤ 30", later a subtree splits on "age ≤ 50". This allows modeling complex non-linear relationships.
Predicting car price categories (Low/Mid/High) using mixed features including continuous mileage and year:
| ID | Make | Mileage | Year | Price | Condition | Price Cat |
|---|---|---|---|---|---|---|
| 1 | Honda | 45,000 | 2018 | $18,500 | Good | Mid |
| 2 | Toyota | 32,000 | 2019 | $22,000 | Excellent | High |
| 3 | Ford | 78,000 | 2015 | $12,000 | Fair | Low |
| 4 | BMW | 52,000 | 2017 | $25,000 | Good | High |
| 5 | Chevrolet | 95,000 | 2013 | $8,500 | Fair | Low |
| 6 | Honda | 28,000 | 2020 | $24,000 | Excellent | High |
| 7 | Ford | 61,000 | 2016 | $14,500 | Good | Mid |
| 8 | Toyota | 41,000 | 2018 | $19,500 | Good | Mid |
Algorithm evaluates all attributes (Make categorical, Mileage continuous, Year continuous, Condition categorical).
Best split found: Mileage ≤ 50000 (Gain = 0.42)
Samples: IDs 1, 2, 4, 6, 8 (mostly High and Mid prices)
Next best split: Year ≤ 2018 (separates Mid from High)
Samples: IDs 3, 5, 7 (all Low or Mid prices)
Next split: Condition = Fair vs others
Notice: Mileage used once at root, Year used once in subtree. Both are continuous but treated naturally by the algorithm—no manual binning required!
Strategies for incomplete data in decision trees
Missing values are ubiquitous in real data: sensor failures, survey non-responses, data entry errors, privacy redactions. Simple approaches like dropping samples or imputing mean values lose information. Advanced decision tree algorithms (particularly C4.5) have built-in methods to handle missing values during both training and prediction.
When calculating information gain for an attribute with missing values, how do we account for samples where that attribute is unknown? We can't test them on that attribute.
When we split on an attribute, which child node do we send samples with missing values for that attribute? We don't know which branch they should take.
When evaluating an attribute with missing values, C4.5 calculates information gain using only samples with known values, then scales down the gain proportionally to the fraction of known values.
Adjusted_Gain(D, a) = ρ × Gain(D̃, a)
Where: ρ = |D̃| / |D| (fraction of samples with known value for attribute a)
D̃ = subset of D where attribute a is not missing
Dataset D: 100 samples. Attribute "Blood Pressure" is missing for 20 samples.
D̃ = 80 samples with known Blood Pressure
ρ = 80/100 = 0.8
Gain(D̃, BloodPressure) = 0.35 (calculated on 80 samples)
Adjusted_Gain(D, BloodPressure) = 0.8 × 0.35 = 0.28
The gain is penalized because 20% of samples can't be used. Attributes with fewer missing values are naturally favored.
When splitting on an attribute, samples with missing values for that attribute are distributed to all child nodes with fractional weights proportional to the number of known-value samples going to each child.
Example: Split on "Exercise Level" (Low/Moderate/High). 60 known samples split: 20 to Low, 25 to Moderate, 15 to High.
Weight for Low child: 20/60 = 0.333
Weight for Moderate child: 25/60 = 0.417
Weight for High child: 15/60 = 0.250
10 samples have missing Exercise Level. Each of these 10 samples is sent to:
Each missing-value sample contributes fractionally to all children. This maintains information—the sample isn't discarded, but its influence is spread according to the probability distribution learned from known-value samples.
CART uses surrogate splits—backup attributes that correlate highly with the primary split attribute. During prediction, if the primary attribute is missing, use a surrogate.
Primary split: "Blood Pressure ≤ 130"
But for some patients, BP is missing. The algorithm finds correlated attributes.
Surrogate split 1: "BMI ≤ 27" (90% agreement with BP split)
If BP is missing but BMI is known, use BMI split instead.
Surrogate split 2: "Age ≤ 45" (75% agreement)
If both BP and BMI are missing but Age is known, use Age.
Benefit: Handles missing values during prediction without needing all features. Particularly useful in production where data collection might be incomplete.
Predicting health risk from patient data with missing values (marked as "?"):
| ID | Age | Blood Pressure | Glucose | BMI | Smoking | Diagnosis |
|---|---|---|---|---|---|---|
| 1 | 45 | 140/90 | 180 | 28.5 | Yes | At Risk |
| 2 | 32 | ? | 95 | 22.1 | No | Healthy |
| 3 | 58 | 150/95 | ? | 31.2 | Yes | At Risk |
| 4 | 41 | 120/80 | 88 | 24.3 | No | Healthy |
| 5 | ? | 138/88 | 165 | 29.8 | Yes | At Risk |
| 6 | 29 | 115/75 | 82 | 21.5 | No | Healthy |
| 7 | 52 | ? | 192 | ? | Yes | At Risk |
| 8 | 36 | 118/78 | 91 | 23.7 | No | Healthy |
New patient: Age=50, BP=?, Glucose=155, BMI=28, Smoking=Yes
| Strategy | Continuous | Missing Values | Algorithm |
|---|---|---|---|
| ID3 | Manual discretization required | Not supported (drop samples) | 1986, educational use |
| C4.5 | Dynamic thresholds, reusable | Adjusted gain + probabilistic distribution | 1993, research standard |
| CART | Dynamic thresholds, binary splits | Surrogate splits | 1984, industry default |
A: No! Scikit-learn's implementation (based on CART) automatically handles continuous features using dynamic threshold selection. Binning beforehand loses information and typically hurts performance. Only bin if you have a specific reason (e.g., domain knowledge suggests natural categories, or you need fewer split points for interpretability).
A: If using basic ID3 or a simple implementation, options include: (1) Drop samples with missing values (simple but loses data), (2) Impute mean/median for continuous features and mode for categorical (adds artificial certainty), (3) Create "missing" categoryfor categorical features (treats missingness as information), (4) Multiple imputation (run multiple trees with different imputations and average). For production, use C4.5, CART, or modern libraries with built-in missing value support.
A: In the worst case, for n samples, there are (n-1) possible split points (between each pair of sorted values). However, optimization: if consecutive samples have the same class label, you don't need to check splits between them. Real datasets often have far fewer useful candidates. For very large datasets, some implementations sample a subset of candidate splits for speed.
A: Yes! This is a key advantage over categorical attributes. For example, the root might split on "age ≤ 30", creating young vs old groups. Later, the "old" subtree might split again on "age ≤ 60" to distinguish middle-aged from elderly. This allows modeling complex non-monotonic relationships that categorical features can't capture in a single split.
A: High cardinality categoricals can be problematic (multi-value bias in ID3/C4.5, or many binary splits in CART). Solutions: (1) Group into meaningful categories (zip codes → regions), (2) Target encoding (replace with target mean for that category), converting to continuous, (3) Use CART with binary splits (better than multi-way), (4) Feature engineering(extract useful information like "urban vs rural" from zip code). Avoid letting the tree memorize individual rare categories.
Continuous features handled via discretization (bins) or dynamic thresholds (better)
C4.5 and CART automatically find optimal split thresholds during tree construction
Continuous attributes can be reused at different nodes with different thresholds
Missing values: C4.5 uses adjusted gain and probabilistic distribution
CART uses surrogate splits as backup attributes for missing primary features
Built-in missing value handling avoids data loss and false certainty from imputation
Modern libraries (scikit-learn) handle both challenges automatically—don't preprocess!
Real-world messy data is the norm; these techniques make trees practical