MathIsimple

Combination Strategies

Learn how to combine individual learners effectively: from simple voting and averaging to advanced learning-based methods like Stacking

How to Combine Learners?

Core Concept

Once you have T individual learners h₁, h₂, ..., hₜ, the next critical step is deciding how to combine their predictions. The combination strategy significantly affects ensemble performance. There are three main approaches:

Averaging

For regression tasks. Simple or weighted average of predictions.

Voting

For classification tasks. Majority vote or weighted vote.

Stacking

Learning-based. Train a meta-learner to combine predictions.

Averaging Methods (Regression)

For regression tasks, we combine continuous predictions using averaging:

1. Simple Averaging

The simplest method: take the arithmetic mean of all predictions.

H(x) = (1/T) Σₜ hₜ(x)

All learners contribute equally to the final prediction

Example:

If 5 trees predict house prices: [$280k, $290k, $285k, $275k, $290k], then H(x) = ($280k + $290k + $285k + $275k + $290k) / 5 = $284k

2. Weighted Averaging

Give more weight to better-performing learners:

H(x) = Σₜ wₜ hₜ(x)

Where wₜ ≥ 0 and Σₜ wₜ = 1

Weights are typically based on individual learner performance (e.g., validation error)

Example:

If 3 models predict with weights [0.4, 0.3, 0.3] and predictions [$280k, $290k, $275k]:

H(x) = 0.4 × $280k + 0.3 × $290k + 0.3 × $275k = $281.5k

Note: Weighted averaging is not always better than simple averaging. If weight estimates are noisy, simple averaging can be more robust.

When to Use Weighted vs Simple Averaging?

Use Simple Averaging when:

  • Learners have similar performance
  • Weight estimation is unreliable
  • You want simplicity and robustness

Use Weighted Averaging when:

  • Learners have clearly different performance
  • You have reliable performance estimates
  • Some learners specialize in certain cases

Voting Methods (Classification)

For classification tasks, we combine discrete predictions using voting:

1. Relative Majority Voting (Plurality)

Output the class that receives the most votes. This is the most common voting method.

H(x) = argmax_c Σₜ [hₜ(x) = c]

Where [hₜ(x) = c] is 1 if hₜ predicts class c, 0 otherwise

Example:

5 models predict: [Positive, Positive, Negative, Positive, Negative]

Votes: Positive = 3, Negative = 2 → H(x) = Positive

2. Absolute Majority Voting

Output a class only if it receives more than half the votes. Otherwise, reject (no prediction).

H(x) = c if Σₜ [hₜ(x) = c] > T/2, else REJECT

More conservative: only predicts when there's strong consensus

Example:

With 5 models: [Positive, Positive, Negative, Positive, Negative]

Positive = 3 votes (60% > 50%) → H(x) = Positive

If votes were [Positive, Positive, Negative, Negative, Negative]:

Positive = 2, Negative = 3 (neither > 50%) → H(x) = REJECT

3. Weighted Voting

Each learner's vote is weighted by its performance. Better learners have more influence.

H(x) = argmax_c Σₜ wₜ [hₜ(x) = c]

Where wₜ is the weight of learner hₜ (typically based on accuracy or error rate)

Example:

3 models with weights [0.4, 0.3, 0.3] predict [Positive, Positive, Negative]:

Weighted votes: Positive = 0.4 + 0.3 = 0.7, Negative = 0.3

H(x) = Positive (0.7 > 0.3)

Stacking (Learning-Based Combination)

Stacking (also called Stacked Generalization) uses a meta-learnerto learn how to best combine the predictions of first-level learners:

Stacking Algorithm

  1. 1.
    Train first-level learners: Train T base learners h₁, h₂, ..., hₜ on training data D
  2. 2.
    Generate new dataset: For each sample (xᵢ, yᵢ) in D, create new feature vector:
    • • Features: [h₁(xᵢ), h₂(xᵢ), ..., hₜ(xᵢ)] (predictions from T learners)
    • • Label: yᵢ (original label)
  3. 3.
    Train meta-learner: Train a second-level learner (meta-learner) on the new dataset
  4. 4.
    Final prediction: For new sample x, first get predictions from T learners, then use meta-learner to combine them

Advantages

  • Learns optimal combination: Meta-learner discovers best way to combine predictions
  • Handles non-linear combinations: Can learn complex relationships between learners
  • Often outperforms voting/averaging: More flexible than fixed combination rules

Challenges

  • Overfitting risk: Meta-learner may overfit to first-level predictions
  • More complex: Requires training two levels of learners
  • Data splitting: Need to use cross-validation to avoid data leakage

Common Meta-Learners

  • Linear Regression (MLR): Simple, interpretable, often works well
  • Logistic Regression: For classification tasks
  • Neural Networks: Can learn complex non-linear combinations
  • Decision Trees: Interpretable meta-learner

Example: Medical Diagnosis Ensemble Voting

A medical diagnosis system uses 3 specialized models with weighted voting:

IDAgeBMIGlucoseBlood PressureSymptomsDiagnosis
14528.595130MildNegative
26232.1145155ModeratePositive
33824.888120NoneNegative
45529.7132142ModeratePositive
54126.3102128MildNegative

Ensemble Setup:

  • Model 1 (Lab-based): Weight 0.4, specializes in glucose and blood pressure
  • Model 2 (Symptom-based): Weight 0.3, focuses on symptom severity
  • Model 3 (Risk-factor): Weight 0.3, analyzes BMI and demographics

Voting Example (Patient ID 2):

Model 1 (Lab-based): Positive (weight: 0.3)

Model 2 (Symptom-based): Positive (weight: 0.4)

Model 3 (Risk-factor): Positive (weight: 0.3)

Weighted vote result: Positive (unanimous)

All models agree, so the ensemble has very high confidence

Example: Stock Price Prediction with Stacking

A financial system uses Stacking to combine predictions from 3 different models:

IDVolumeVolatilityEarningsPrice
11,500,0000.15$2.5$125.5
22,300,0000.22$1.8$98.3
31,800,0000.18$3.2$145.8
42,100,0000.25$1.5$87.2
51,650,0000.19$2.8$132.4

Stacking Setup:

  • First-level learners: Linear Regression, Random Forest, Neural Network
  • Meta-learner: Linear Regression (MLR)
  • Process: Train first-level learners, generate predictions, train meta-learner on predictions

Example Prediction (Stock ID 1):

Linear Regression prediction: $124.20

Random Forest prediction: $126.80

Neural Network prediction: $125.10

Meta-learner input: [124.20, 126.80, 125.10]

Meta-learner output (final prediction): $125.37

The meta-learner learns optimal weights (e.g., 0.2, 0.5, 0.3) to combine the three predictions

Strategy Selection Guidelines

When to Use Each Strategy

Use Averaging/Voting when:

  • Learners are similar in performance
  • You want simplicity and interpretability
  • Computational efficiency is important

Use Stacking when:

  • Learners have complementary strengths
  • You have sufficient data for meta-learner training
  • Maximum performance is the priority

Frequently Asked Questions

Q: Is weighted averaging always better than simple averaging?

A: No. If weight estimates are unreliable or noisy, simple averaging can be more robust. Weighted averaging helps when you have confident, accurate performance estimates and clear differences between learners.

Q: When should I use absolute majority voting vs relative majority voting?

A: Use absolute majority when you need high confidence and can afford to reject uncertain cases (e.g., medical diagnosis). Use relative majority when you always need a prediction (e.g., recommendation systems).

Q: How do I prevent overfitting in Stacking?

A: Use cross-validation to generate first-level predictions. For each fold, train first-level learners on training fold and predict on validation fold. This ensures meta-learner sees "unseen" predictions, preventing data leakage.

Q: Can I combine different combination strategies?

A: Yes! For example, you can use Stacking where the meta-learner itself uses weighted voting. Or use different strategies for different subsets of learners. This is called "blending" and can further improve performance.