Generative Methods in Semi-Supervised Learning

Core Idea

Generative methods assume that all samples (both labeled $D_l$ and unlabeled $D_u$ ) are generated by a generative model (e.g., Gaussian Mixture Model, Naive Bayes). Each class corresponds to a "generative component" of the model. By maximizing the joint likelihood of labeled and unlabeled samples, we estimate model parameters and achieve classification.

Key Insight

Instead of only using labeled samples to estimate parameters, we use both labeled and unlabeled samplesto maximize the joint likelihood. This allows the model to better capture the true data distribution, especially when labeled samples are limited.

Advantages

• Solid theoretical foundation
• Strong interpretability
• Can leverage unlabeled data
• Probabilistic framework

Limitations

• Model assumption sensitive
• May degrade if assumptions fail
• Requires matching distribution
• Computationally intensive

Gaussian Mixture Model (GMM)

The most common generative model for semi-supervised learning. We assume samples are generated from a mixture of N Gaussian components, where each component corresponds to one class.

Data Generation Assumption

Sample $x$ is generated from a mixture of N Gaussian components:

p(x) = \sum_{i=1}^{N} \alpha_i \cdot p(x|\mu_i,\Sigma_i)

Where:

• $\alpha_i$ = weight of the i-th Gaussian component (mixing coefficient), satisfying $\sum_{i=1}^N \alpha_i = 1$
• $p(x|\mu_i,\Sigma_i)$ = probability density of the i-th Gaussian component
• $\mu_i$ = mean vector of the i-th component
• $\Sigma_i$ = covariance matrix of the i-th component
• Each Gaussian component corresponds to one class (e.g., component 1 = "cat", component 2 = "dog")

Example: Two-Class Problem

For a binary classification problem (e.g., spam vs ham emails):

• Component 1 (α₁, μ₁, Σ₁): Generates "Spam" emails
• Component 2 (α₂, μ₂, Σ₂): Generates "Ham" emails
• $\alpha_1 + \alpha_2 = 1$ (weights sum to 1)

Parameter Estimation: Maximum Likelihood + EM Algorithm

Our goal is to maximize the joint log-likelihood of both labeled and unlabeled samples. We use the EM algorithm to iteratively solve for model parameters.

Objective Function

\max_{\{\alpha_i, \mu_i, \Sigma_i\}} \sum_{(x_j, y_j) \in D_l} \ln p(x_j, y_j) + \sum_{x_j \in D_u} \ln p(x_j)

We maximize the sum of log-likelihoods: labeled samples contribute $\ln p(x_j, y_j)$ (known class), while unlabeled samples contribute $\ln p(x_j)$ (unknown class).

E-Step

Expectation Step

Given current parameters, compute the posterior probability that each unlabeled sample belongs to each Gaussian component (soft assignment):

\gamma_{ji} = \frac{\alpha_i \cdot p(x_j|\mu_i,\Sigma_i)}{\sum_{k=1}^N \alpha_k \cdot p(x_j|\mu_k,\Sigma_k)}

Where:

• $\gamma_{ji}$ = probability that unlabeled sample $x_j \in D_u$ belongs to component i
• This is the "soft label" for unlabeled samples
• $\sum_{i=1}^N \gamma_{ji} = 1$ (probabilities sum to 1)

M-Step

Maximization Step

Update parameters using both labeled samples (hard assignment) and unlabeled samples (soft assignment from E-step):

Mean Vector $\mu_i$ :

\mu_i = \frac{1}{\sum_{x_j \in D_u} \gamma_{ji} + l_i} \left( \sum_{x_j \in D_u} \gamma_{ji}x_j + \sum_{(x_j,y_j) \in D_l, y_j=i} x_j \right)

Where $l_i$ = number of labeled samples in class i. Combines weighted unlabeled samples (soft contribution) and labeled samples (hard contribution).

Mixing Coefficient $\alpha_i$ :

\alpha_i = \frac{\sum_{x_j \in D_u} \gamma_{ji} + l_i}{u + l}

Proportion of samples (weighted for unlabeled) belonging to component i.

Covariance Matrix $\Sigma_i$ :

\Sigma_i = \frac{1}{\sum_{x_j \in D_u} \gamma_{ji} + l_i} \left( \sum_{x_j \in D_u} \gamma_{ji}(x_j - \mu_i)(x_j - \mu_i)^T + \sum_{(x_j,y_j) \in D_l, y_j=i} (x_j - \mu_i)(x_j - \mu_i)^T \right)

Weighted covariance combining soft assignments from unlabeled samples and hard assignments from labeled samples.

Algorithm Flow

1. Initialize parameters (αᵢ, μᵢ, Σᵢ) using labeled samples only
2. E-step: Compute γⱼᵢ for all unlabeled samples
3. M-step: Update parameters using both labeled and unlabeled samples
4. Repeat steps 2-3 until convergence (parameters stabilize)

Classification Rule

For a new sample $x$ , predict its class by maximizing the posterior probability:

f(x) = \underset{j \in \mathcal{Y}}{\arg\max} \sum_{i=1}^N p(y=j|\Theta=i,x) \cdot p(\Theta=i|x)

Where:

• $\Theta=i$ = sample is generated by component i
• $p(\Theta=i|x)$ = posterior probability (same as γⱼᵢ from E-step)
• $p(y=j|\Theta=i,x)$ = probability that component i corresponds to class j
• For labeled samples, this is known; for unlabeled, it's estimated from the model

Customer Segmentation Example

Apply GMM-based semi-supervised learning to segment 200 e-commerce customers into "Budget" and "Affluent" segments. Only 50 customers are labeled, while 150 remain unlabeled.

Customer Dataset: Mixed Labeled and Unlabeled

ID	Age	Income	Spending	Label	Type
1	28	$45,000	$3,200	Budget	labeled
2	45	$85,000	$8,500	Affluent	labeled
3	22	$28,000	$1,200	Budget	labeled
4	52	$120,000	$12,000	Affluent	labeled
5	35	$65,000	$4,800	Budget	labeled
6	31	$58,000	$4,200	Unknown	unlabeled
7	48	$95,000	$9,800	Unknown	unlabeled
8	29	$42,000	$2,800	Unknown	unlabeled
9	55	$110,000	$11,500	Unknown	unlabeled
10	26	$35,000	$2,100	Unknown	unlabeled

Dataset: 200 customers total (50 labeled: 25 Budget, 25 Affluent; 150 unlabeled). Features: Age, Annual Income, Annual Spending.

GMM Training Process

Initialization:

Use 50 labeled samples to initialize 2 Gaussian components:

• Component 1 (Budget): μ₁ = mean of 25 Budget customers, Σ₁ = covariance
• Component 2 (Affluent): μ₂ = mean of 25 Affluent customers, Σ₂ = covariance
• α₁ = α₂ = 0.5 (equal weights initially)

E-Step (Iteration 1):

Compute soft assignments for 150 unlabeled customers:

• Customer 6: γ₆₁ = 0.72 (72% Budget), γ₆₂ = 0.28 (28% Affluent)
• Customer 7: γ₇₁ = 0.15 (15% Budget), γ₇₂ = 0.85 (85% Affluent)
• Customer 8: γ₈₁ = 0.88 (88% Budget), γ₈₂ = 0.12 (12% Affluent)

M-Step (Iteration 1):

Update parameters using weighted contributions:

• μ₁ = weighted average (25 hard Budget + 150 soft Budget contributions)
• μ₂ = weighted average (25 hard Affluent + 150 soft Affluent contributions)
• α₁, α₂, Σ₁, Σ₂ updated similarly

Convergence (After 8 iterations):

Parameters stabilize. Final model achieves 92% accuracy on test set (vs 78% using labeled samples only).

Advantages and Limitations

Advantages

• Solid theoretical foundation: Well-established probabilistic framework
• Strong interpretability: Each component has clear meaning (class)
• Can leverage unlabeled data: Improves parameter estimation
• Probabilistic outputs: Provides confidence scores, not just hard labels

Limitations

• Model assumption sensitive: Requires data to match assumed distribution
• May degrade performance: If assumptions fail, unlabeled data can hurt
• Requires matching distribution: GMM assumes Gaussian; may not fit all data
• Computationally intensive: EM algorithm can be slow for large datasets

Common Extensions

• Mixture of Experts: Multiple specialized models instead of single GMM
• Naive Bayes: For discrete/categorical data instead of continuous
• Variational EM: Faster approximation for large-scale problems

Generative Methods

Core Idea

Key Insight

Advantages

Limitations

Gaussian Mixture Model (GMM)

Data Generation Assumption

Example: Two-Class Problem

Parameter Estimation: Maximum Likelihood + EM Algorithm

Objective Function

Expectation Step

Maximization Step

Mean Vector μi\mu_iμi​:

Mixing Coefficient αi\alpha_iαi​:

Covariance Matrix Σi\Sigma_iΣi​:

Algorithm Flow

Classification Rule

Customer Segmentation Example

Customer Dataset: Mixed Labeled and Unlabeled

GMM Training Process

Initialization:

E-Step (Iteration 1):

M-Step (Iteration 1):

Convergence (After 8 iterations):

Advantages and Limitations

Advantages

Limitations

Common Extensions

Mean Vector $\mu_i$ :

Mixing Coefficient $\alpha_i$ :

Covariance Matrix $\Sigma_i$ :