Master generative model-based semi-supervised learning. Learn how Gaussian Mixture Models leverage both labeled and unlabeled samples through the EM algorithm to improve classification performance.
Generative methods assume that all samples (both labeled and unlabeled ) are generated by a generative model (e.g., Gaussian Mixture Model, Naive Bayes). Each class corresponds to a "generative component" of the model. By maximizing the joint likelihood of labeled and unlabeled samples, we estimate model parameters and achieve classification.
Instead of only using labeled samples to estimate parameters, we use both labeled and unlabeled samplesto maximize the joint likelihood. This allows the model to better capture the true data distribution, especially when labeled samples are limited.
The most common generative model for semi-supervised learning. We assume samples are generated from a mixture of N Gaussian components, where each component corresponds to one class.
Sample is generated from a mixture of N Gaussian components:
Where:
For a binary classification problem (e.g., spam vs ham emails):
Our goal is to maximize the joint log-likelihood of both labeled and unlabeled samples. We use the EM algorithm to iteratively solve for model parameters.
We maximize the sum of log-likelihoods: labeled samples contribute (known class), while unlabeled samples contribute (unknown class).
Given current parameters, compute the posterior probability that each unlabeled sample belongs to each Gaussian component (soft assignment):
Where:
Update parameters using both labeled samples (hard assignment) and unlabeled samples (soft assignment from E-step):
Where = number of labeled samples in class i. Combines weighted unlabeled samples (soft contribution) and labeled samples (hard contribution).
Proportion of samples (weighted for unlabeled) belonging to component i.
Weighted covariance combining soft assignments from unlabeled samples and hard assignments from labeled samples.
For a new sample , predict its class by maximizing the posterior probability:
Where:
Apply GMM-based semi-supervised learning to segment 200 e-commerce customers into "Budget" and "Affluent" segments. Only 50 customers are labeled, while 150 remain unlabeled.
| ID | Age | Income | Spending | Label | Type |
|---|---|---|---|---|---|
| 1 | 28 | $45,000 | $3,200 | Budget | labeled |
| 2 | 45 | $85,000 | $8,500 | Affluent | labeled |
| 3 | 22 | $28,000 | $1,200 | Budget | labeled |
| 4 | 52 | $120,000 | $12,000 | Affluent | labeled |
| 5 | 35 | $65,000 | $4,800 | Budget | labeled |
| 6 | 31 | $58,000 | $4,200 | Unknown | unlabeled |
| 7 | 48 | $95,000 | $9,800 | Unknown | unlabeled |
| 8 | 29 | $42,000 | $2,800 | Unknown | unlabeled |
| 9 | 55 | $110,000 | $11,500 | Unknown | unlabeled |
| 10 | 26 | $35,000 | $2,100 | Unknown | unlabeled |
Dataset: 200 customers total (50 labeled: 25 Budget, 25 Affluent; 150 unlabeled). Features: Age, Annual Income, Annual Spending.
Use 50 labeled samples to initialize 2 Gaussian components:
Compute soft assignments for 150 unlabeled customers:
Update parameters using weighted contributions:
Parameters stabilize. Final model achieves 92% accuracy on test set (vs 78% using labeled samples only).