MathIsimple

Multivariate Decision Trees

Learn advanced decision trees that use linear combinations of features for oblique splits— creating more expressive and accurate decision boundaries

Beyond Axis-Parallel Splits

Advanced Topic

Traditional decision trees (univariate/axis-parallel) test one feature at a time at each node (e.g., "income > 50k?" or "age ≤ 30?"). This creates decision boundaries parallel to feature axes.Multivariate decision trees (also called oblique decision trees) test linear combinations of multiple features at each node (e.g., "2×income + 3×spending > 250?"), creating diagonal (oblique) decision boundaries that can better fit complex data patterns.

Univariate (Axis-Parallel)

Each node tests a single feature. Splits are perpendicular to feature axes, creating rectangular regions.

Example splits:
• income ≤ 50000
• age > 30
• spending ≤ 2000

Limitations: May need many splits to approximate diagonal boundaries; less expressive.

Multivariate (Oblique)

Each node tests a weighted combination of features. Splits can be diagonal, creating more flexible regions.

Example splits:
• 0.6×income + 0.4×spending ≤ 50
• 2×age - 1×loyalty > 60
• income - 2×spending ≤ 10

Advantages: More compact trees; better for correlated features; can model complex boundaries directly.

Key Intuition

Imagine classifying customers based on income and spending. An axis-parallel tree might create many rectangular regions with rules like "if income > 50k AND spending > 30k...". A multivariate tree could capture the same pattern more elegantly with a single oblique split: "if 0.5×income + 0.5×spending > 40k, then premium customer". This is especially powerful when features are naturally correlated.

Mathematical Formulation

Linear discriminants at tree nodes

In a multivariate decision tree, each internal node performs a test of the form:

Multivariate Split Condition

w₁x₁ + w₂x₂ + ... + wₐxₐ ≤ t

or more compactly: wTx ≤ t

Where:

  • x = [x₁, x₂, ..., xₐ] is the feature vector
  • w = [w₁, w₂, ..., wₐ] are the weights (coefficients)
  • t is the threshold
  • Samples satisfying the condition go to left child, others to right child

Geometric Interpretation

Univariate Split

Test: x₁ ≤ t (e.g., income ≤ 50k)

In 2D feature space (income, spending), this creates a vertical line at income=50k. Decision boundary is perpendicular to the x-axis.

Weights: w = [1, 0]

Equation: 1×income + 0×spending = 50

Multivariate Split

Test: 0.6×income + 0.4×spending ≤ 50

Creates a diagonal line through the 2D space. Decision boundary has non-zero slope, capturing correlations between features.

Weights: w = [0.6, 0.4]

Rearranged: spending = -1.5×income + 125

Example: Customer Segmentation

Separating Premium vs Budget customers based on income (in $1000s) and monthly spending (in $100s):

Axis-Parallel Approach (Traditional)

Split 1: income ≤ 60 → majority Budget
Split 2: income > 60 AND spending ≤ 50 → Standard
Split 3: income > 60 AND spending > 50 → Premium

Needs 3 splits to separate classes; creates staircase boundary

Multivariate Approach (Oblique)

Split 1: 0.7×income + 0.3×spending ≤ 55 → Budget
Otherwise → Premium/Standard (one more split)

Single oblique split captures the diagonal trend; more compact tree

How to Find Optimal Weights?

The challenge in multivariate trees is determining the weight vector w and thresholdt. Unlike univariate trees where you test each feature independently, here you must search a continuous space of possible linear combinations. Several methods exist:

Method 1: Linear Discriminant Analysis (LDA)

Use LDA to find the direction that maximizes class separability. LDA finds weights that maximize the ratio of between-class variance to within-class variance.

Process:

  1. At each node, apply LDA to find optimal linear discriminant
  2. Use discriminant direction as weight vector w
  3. Find threshold t that best separates classes along this direction
  4. This is the split for the current node
Advantage: Fast, has closed-form solution
Limitation: Assumes Gaussian distributions, works best for binary classification

Method 2: Linear Support Vector Machine (SVM)

Train a linear SVM at each node to find the hyperplane that best separates classes with maximum margin.

Process:

  1. At each node, solve SVM optimization: find w, t maximizing margin
  2. Use SVM's weight vector and bias as split parameters
  3. Split samples according to which side of the hyperplane they fall
Advantage: Robust, handles non-linearly separable data (soft margin)
Limitation: Computationally expensive, needs optimization solver at each node

Method 3: Simulated Annealing / Gradient Descent

Use optimization algorithms to search for weights that minimize impurity (Gini or entropy).

Process:

  1. Initialize random weights w
  2. For different values of w and t, calculate Gini/entropy after split
  3. Use optimization (SA, genetic algorithms, or gradient-based) to find w, t minimizing impurity
  4. Accept the best split found
Advantage: Flexible, directly optimizes tree criterion
Limitation: Very slow, no guarantee of global optimum, sensitive to initialization

Method 4: Random Linear Combinations

Generate many random weight vectors, evaluate each, and keep the best (used in some ensemble methods).

Process:

  1. Sample N random weight vectors (e.g., N=100)
  2. For each, find best threshold by sorting samples along that direction
  3. Calculate information gain or Gini reduction for each candidate
  4. Select the linear combination with best impurity reduction
Advantage: Simple, parallelizable, adds diversity in ensembles
Limitation: Not systematic, may miss optimal splits, requires many samples

Practical Consideration

Computational Cost: Finding optimal multivariate splits is NP-hard and much more expensive than univariate splits. For d features, univariate trees evaluate O(d) candidates per node. Multivariate trees must search the continuous space of all possible linear combinations—exponentially harder. This is why multivariate trees are less common in practice, though they can achieve better accuracy with fewer nodes.

Example: Customer Segmentation

An e-commerce company wants to segment customers into Budget, Standard, and Premium tiers based on their behavior:

IDIncome ($k)Spending ($k)AgeLoyalty (yrs)Segment
14530282Budget
28570355Premium
35540423Standard
49580386Premium
53525251Budget
67055454Standard
75045313Standard
84028292Budget

Univariate Tree (Axis-Parallel)

Root:

income ≤ 60?

Left → mostly Budget, Right → Standard/Premium mixed

Right child:

spending ≤ 50?

Left → Standard, Right → Premium

Statistics:

  • • Tree depth: 3
  • • Total nodes: 7
  • • Accuracy: 87%

Multivariate Tree (Oblique)

Root:

0.6×income + 0.4×spending ≤ 48?

Left → Budget, Right → Standard/Premium

Right child:

0.7×spending + 0.3×loyalty ≤ 60?

Left → Standard, Right → Premium

Statistics:

  • • Tree depth: 2 ✓
  • • Total nodes: 5 ✓
  • • Accuracy: 92% ✓

Why Multivariate Works Better Here

  • Correlated features: Income and spending are naturally correlated—higher earners tend to spend more. A single oblique split captures this relationship directly, while axis-parallel splits need multiple cuts to approximate the diagonal boundary.
  • Compact representation: 5 nodes vs 7 nodes—the multivariate tree is simpler and more interpretable despite using linear combinations.
  • Better generalization: By modeling the true structure (diagonal separability), the multivariate tree achieves 92% accuracy vs 87% for axis-parallel.

Univariate vs Multivariate: Trade-offs

AspectUnivariate (Axis-Parallel)Multivariate (Oblique)
Decision BoundaryAxis-parallel (perpendicular to features)Oblique (any angle, more flexible)
Tree SizeLarger (more splits needed)Smaller (more expressive splits)
Training TimeFast (O(d×n log n) per node)Slow (optimization at each node)
InterpretabilityVery high (simple rules)Moderate (linear combinations)
AccuracyGood for independent featuresBetter for correlated features
ImplementationSimple, widely availableComplex, limited libraries
Best ForGeneral use, high-dimensional dataCorrelated features, low dimensions

When to Use Each Approach

Use Univariate When:

  • Interpretability is critical (business rules, regulations)
  • Features are largely independent
  • Training time is limited (large datasets)
  • Using standard libraries (scikit-learn, XGBoost)
  • High-dimensional data (>20-30 features)

Use Multivariate When:

  • Features are highly correlated (e.g., height and weight)
  • Want smaller, more compact trees
  • Low-to-moderate dimensions (2-10 features)
  • Accuracy is paramount, computational cost acceptable
  • Classes are diagonally separable in feature space

Frequently Asked Questions

Q: Are there any Python libraries that implement multivariate decision trees?

A: Standard scikit-learn only supports univariate trees. However, you can find implementations in specialized libraries: obliquetree (Python package), OC1 (Oblique Classifier 1, classic algorithm), and some research implementations on GitHub. Alternatively, you can implement LDA-based splits yourself by wrapping scikit-learn's DecisionTreeClassifier with custom splitting logic. For most applications, standard univariate trees or ensemble methods (Random Forest, XGBoost) perform well enough.

Q: If multivariate trees are better, why aren't they the default?

A: Three main reasons: (1) Computational cost—finding optimal linear combinations is much slower than testing single features, especially with many features. (2) Interpretability loss—rules like "0.73×income + 0.41×age - 0.19×spending > 42" are harder to explain than "income > 50k". (3) Diminishing returns with ensembles—modern ensemble methods (Random Forests, Gradient Boosting) combine many univariate trees to implicitly approximate oblique boundaries, achieving similar or better accuracy without the complexity.

Q: Can you combine univariate and multivariate splits in the same tree?

A: Yes! A hybrid approach can evaluate both univariate and multivariate candidates at each node, choosing whichever provides better impurity reduction. This balances expressiveness with computational cost—use oblique splits only where they significantly help. Some implementations test a few random linear combinations alongside all single-feature splits, keeping the best overall. This adds moderate computational overhead while capturing some benefits of multivariate trees.

Q: Do ensemble methods like Random Forests use multivariate splits?

A: Standard Random Forests use univariate trees. However, there are research variants calledOblique Random Forests that use multivariate splits at each node. These can outperform standard RF on certain datasets but are computationally expensive and rarely used in practice. Interestingly, ensembles of univariate trees can approximate oblique boundaries through stacking—different trees split on different features, and the ensemble vote effectively creates non-axis-parallel boundaries at the meta-level.

Q: Should I normalize features before building multivariate trees?

A: Yes, strongly recommended! Since weights combine features, different scales affect the linear combination. If income (0-200k) and age (18-100) are on different scales, the weight on income will naturally be smaller to compensate. Normalizing (z-score or min-max) to similar ranges ensures weights reflect true feature importance, not just their scales. Methods like LDA and SVM handle this internally, but for custom implementations or gradient-based optimization, normalize first.

Key Takeaways

Multivariate trees use linear combinations: w₁x₁ + w₂x₂ + ... ≤ t

Create oblique (diagonal) decision boundaries, more flexible than axis-parallel

More compact trees (fewer nodes) for same accuracy when features correlated

Finding weights is computationally expensive: LDA, SVM, or optimization algorithms

Trade-off: Better accuracy vs longer training time and lower interpretability

Best for low-dimensional data with correlated features

Univariate trees remain default due to speed, interpretability, ensemble success

Limited library support; mostly research/specialized implementations