MathIsimple

Generative Methods

Master generative model-based semi-supervised learning. Learn how Gaussian Mixture Models leverage both labeled and unlabeled samples through the EM algorithm to improve classification performance.

Module 2 of 6
Intermediate to Advanced
100-130 min

Core Idea

Generative methods assume that all samples (both labeled DlD_l and unlabeled DuD_u) are generated by a generative model (e.g., Gaussian Mixture Model, Naive Bayes). Each class corresponds to a "generative component" of the model. By maximizing the joint likelihood of labeled and unlabeled samples, we estimate model parameters and achieve classification.

Key Insight

Instead of only using labeled samples to estimate parameters, we use both labeled and unlabeled samplesto maximize the joint likelihood. This allows the model to better capture the true data distribution, especially when labeled samples are limited.

Advantages

  • • Solid theoretical foundation
  • • Strong interpretability
  • • Can leverage unlabeled data
  • • Probabilistic framework

Limitations

  • • Model assumption sensitive
  • • May degrade if assumptions fail
  • • Requires matching distribution
  • • Computationally intensive

Gaussian Mixture Model (GMM)

The most common generative model for semi-supervised learning. We assume samples are generated from a mixture of N Gaussian components, where each component corresponds to one class.

Data Generation Assumption

Sample xx is generated from a mixture of N Gaussian components:

p(x)=i=1Nαip(xμi,Σi)p(x) = \sum_{i=1}^{N} \alpha_i \cdot p(x|\mu_i,\Sigma_i)

Where:

  • αi\alpha_i = weight of the i-th Gaussian component (mixing coefficient), satisfying i=1Nαi=1\sum_{i=1}^N \alpha_i = 1
  • p(xμi,Σi)p(x|\mu_i,\Sigma_i) = probability density of the i-th Gaussian component
  • μi\mu_i = mean vector of the i-th component
  • Σi\Sigma_i = covariance matrix of the i-th component
  • • Each Gaussian component corresponds to one class (e.g., component 1 = "cat", component 2 = "dog")

Example: Two-Class Problem

For a binary classification problem (e.g., spam vs ham emails):

  • Component 1 (α₁, μ₁, Σ₁): Generates "Spam" emails
  • Component 2 (α₂, μ₂, Σ₂): Generates "Ham" emails
  • α1+α2=1\alpha_1 + \alpha_2 = 1 (weights sum to 1)

Parameter Estimation: Maximum Likelihood + EM Algorithm

Our goal is to maximize the joint log-likelihood of both labeled and unlabeled samples. We use the EM algorithm to iteratively solve for model parameters.

Objective Function

max{αi,μi,Σi}(xj,yj)Dllnp(xj,yj)+xjDulnp(xj)\max_{\{\alpha_i, \mu_i, \Sigma_i\}} \sum_{(x_j, y_j) \in D_l} \ln p(x_j, y_j) + \sum_{x_j \in D_u} \ln p(x_j)

We maximize the sum of log-likelihoods: labeled samples contribute lnp(xj,yj)\ln p(x_j, y_j)(known class), while unlabeled samples contribute lnp(xj)\ln p(x_j) (unknown class).

E-Step

Expectation Step

Given current parameters, compute the posterior probability that each unlabeled sample belongs to each Gaussian component (soft assignment):

γji=αip(xjμi,Σi)k=1Nαkp(xjμk,Σk)\gamma_{ji} = \frac{\alpha_i \cdot p(x_j|\mu_i,\Sigma_i)}{\sum_{k=1}^N \alpha_k \cdot p(x_j|\mu_k,\Sigma_k)}

Where:

  • γji\gamma_{ji} = probability that unlabeled sample xjDux_j \in D_u belongs to component i
  • • This is the "soft label" for unlabeled samples
  • i=1Nγji=1\sum_{i=1}^N \gamma_{ji} = 1 (probabilities sum to 1)
M-Step

Maximization Step

Update parameters using both labeled samples (hard assignment) and unlabeled samples (soft assignment from E-step):

Mean Vector μi\mu_i:

μi=1xjDuγji+li(xjDuγjixj+(xj,yj)Dl,yj=ixj)\mu_i = \frac{1}{\sum_{x_j \in D_u} \gamma_{ji} + l_i} \left( \sum_{x_j \in D_u} \gamma_{ji}x_j + \sum_{(x_j,y_j) \in D_l, y_j=i} x_j \right)

Where lil_i = number of labeled samples in class i. Combines weighted unlabeled samples (soft contribution) and labeled samples (hard contribution).

Mixing Coefficient αi\alpha_i:

αi=xjDuγji+liu+l\alpha_i = \frac{\sum_{x_j \in D_u} \gamma_{ji} + l_i}{u + l}

Proportion of samples (weighted for unlabeled) belonging to component i.

Covariance Matrix Σi\Sigma_i:

Σi=1xjDuγji+li(xjDuγji(xjμi)(xjμi)T+(xj,yj)Dl,yj=i(xjμi)(xjμi)T)\Sigma_i = \frac{1}{\sum_{x_j \in D_u} \gamma_{ji} + l_i} \left( \sum_{x_j \in D_u} \gamma_{ji}(x_j - \mu_i)(x_j - \mu_i)^T + \sum_{(x_j,y_j) \in D_l, y_j=i} (x_j - \mu_i)(x_j - \mu_i)^T \right)

Weighted covariance combining soft assignments from unlabeled samples and hard assignments from labeled samples.

Algorithm Flow

  1. 1. Initialize parameters (αᵢ, μᵢ, Σᵢ) using labeled samples only
  2. 2. E-step: Compute γⱼᵢ for all unlabeled samples
  3. 3. M-step: Update parameters using both labeled and unlabeled samples
  4. 4. Repeat steps 2-3 until convergence (parameters stabilize)

Classification Rule

For a new sample xx, predict its class by maximizing the posterior probability:

f(x)=argmaxjYi=1Np(y=jΘ=i,x)p(Θ=ix)f(x) = \underset{j \in \mathcal{Y}}{\arg\max} \sum_{i=1}^N p(y=j|\Theta=i,x) \cdot p(\Theta=i|x)

Where:

  • Θ=i\Theta=i = sample is generated by component i
  • p(Θ=ix)p(\Theta=i|x) = posterior probability (same as γⱼᵢ from E-step)
  • p(y=jΘ=i,x)p(y=j|\Theta=i,x) = probability that component i corresponds to class j
  • • For labeled samples, this is known; for unlabeled, it's estimated from the model

Customer Segmentation Example

Apply GMM-based semi-supervised learning to segment 200 e-commerce customers into "Budget" and "Affluent" segments. Only 50 customers are labeled, while 150 remain unlabeled.

Customer Dataset: Mixed Labeled and Unlabeled

IDAgeIncomeSpendingLabelType
128$45,000$3,200
Budget
labeled
245$85,000$8,500
Affluent
labeled
322$28,000$1,200
Budget
labeled
452$120,000$12,000
Affluent
labeled
535$65,000$4,800
Budget
labeled
631$58,000$4,200Unknown
unlabeled
748$95,000$9,800Unknown
unlabeled
829$42,000$2,800Unknown
unlabeled
955$110,000$11,500Unknown
unlabeled
1026$35,000$2,100Unknown
unlabeled

Dataset: 200 customers total (50 labeled: 25 Budget, 25 Affluent; 150 unlabeled). Features: Age, Annual Income, Annual Spending.

GMM Training Process

Initialization:

Use 50 labeled samples to initialize 2 Gaussian components:

  • • Component 1 (Budget): μ₁ = mean of 25 Budget customers, Σ₁ = covariance
  • • Component 2 (Affluent): μ₂ = mean of 25 Affluent customers, Σ₂ = covariance
  • • α₁ = α₂ = 0.5 (equal weights initially)

E-Step (Iteration 1):

Compute soft assignments for 150 unlabeled customers:

  • • Customer 6: γ₆₁ = 0.72 (72% Budget), γ₆₂ = 0.28 (28% Affluent)
  • • Customer 7: γ₇₁ = 0.15 (15% Budget), γ₇₂ = 0.85 (85% Affluent)
  • • Customer 8: γ₈₁ = 0.88 (88% Budget), γ₈₂ = 0.12 (12% Affluent)

M-Step (Iteration 1):

Update parameters using weighted contributions:

  • • μ₁ = weighted average (25 hard Budget + 150 soft Budget contributions)
  • • μ₂ = weighted average (25 hard Affluent + 150 soft Affluent contributions)
  • • α₁, α₂, Σ₁, Σ₂ updated similarly

Convergence (After 8 iterations):

Parameters stabilize. Final model achieves 92% accuracy on test set (vs 78% using labeled samples only).

Advantages and Limitations

Advantages

  • Solid theoretical foundation: Well-established probabilistic framework
  • Strong interpretability: Each component has clear meaning (class)
  • Can leverage unlabeled data: Improves parameter estimation
  • Probabilistic outputs: Provides confidence scores, not just hard labels

Limitations

  • Model assumption sensitive: Requires data to match assumed distribution
  • May degrade performance: If assumptions fail, unlabeled data can hurt
  • Requires matching distribution: GMM assumes Gaussian; may not fit all data
  • Computationally intensive: EM algorithm can be slow for large datasets

Common Extensions

  • Mixture of Experts: Multiple specialized models instead of single GMM
  • Naive Bayes: For discrete/categorical data instead of continuous
  • Variational EM: Faster approximation for large-scale problems