Understand the foundation of semi-supervised learning: sample definitions, core assumptions, and learning paradigms that enable learning from limited labeled data.
Semi-supervised learning is a learning paradigm that leverages both labeled and unlabeled samples to improve learning performance. Unlike supervised learning (which requires all samples to be labeled) or unsupervised learning (which uses no labels), semi-supervised learning addresses the common real-world scenario where labeled data is scarce but unlabeled data is abundant.
Without relying on external interaction, automatically utilize unlabeled samples to improve the learner's prediction performance on "samples to be classified." The core goal is to compensate for the insufficiency of labeled samples.
In semi-supervised learning, samples are divided into two categories based on whether their class labels are known.
Samples where both the input and the corresponding class label are known.
Example:
"Image 1 is a cat" - we know both the image (input) and its label (cat).
Samples where only the input is known, but the class label is unknown.
Example:
"Image 2" - we know the image (input) but not its category (label unknown).
Consider a spam email detection system with 200 emails. Only 50 emails have been manually labeled (as "Spam" or "Ham"), while 150 emails remain unlabeled. This is a typical semi-supervised learning scenario.
| ID | Sender | Subject | Words | Links | Label | Type |
|---|---|---|---|---|---|---|
| 1 | noreply@bank.com | Verify your account | 45 | 2 | Spam | labeled |
| 2 | john@company.com | Meeting tomorrow | 120 | 0 | Ham | labeled |
| 3 | newsletter@store.com | Summer sale | 80 | 3 | Ham | labeled |
| 4 | support@service.com | Your order shipped | 95 | 1 | Ham | labeled |
| 5 | promo@website.com | Win $1000 now | 30 | 5 | Spam | labeled |
| 6 | team@startup.com | Weekly update | 150 | 0 | Unknown | unlabeled |
| 7 | deals@retail.com | Limited time offer | 60 | 4 | Unknown | unlabeled |
| 8 | client@business.com | Project proposal | 200 | 1 | Unknown | unlabeled |
| 9 | alert@security.com | Suspicious activity | 70 | 2 | Unknown | unlabeled |
| 10 | info@conference.com | Registration open | 110 | 1 | Unknown | unlabeled |
Dataset contains 200 emails total: 50 labeled (25 Spam, 25 Ham) and 150 unlabeled. Features include sender domain, subject length, word count, and number of links.
The effectiveness of semi-supervised learning relies on key assumptions about data distribution. All semi-supervised methods are designed around these two fundamental assumptions:
Data exhibits a "cluster structure" where samples within the same cluster are highly likely to belong to the same class.
Intuition:
"Similar samples (in the same cluster) should have the same label." For example, if multiple similar images form a cluster, they are likely all dogs or all cats.
Mathematical Formulation:
If samples and are in the same cluster, then
Methods Based on This Assumption:
Data lies on a low-dimensional manifold embedded in high-dimensional space. Neighboring samples on the manifold should have similar output values.
Intuition:
"Nearby samples (on the manifold) should have similar labels." For example, images with similar pixel patterns are more likely to have the same category.
Mathematical Formulation:
If is small (samples are close on the manifold), then should be small (similar predictions).
Methods Based on This Assumption:
If these assumptions are violated (e.g., data doesn't have cluster structure or doesn't lie on a manifold), unlabeled samples may actually degrade performance compared to using only labeled samples. Always validate assumptions before applying semi-supervised methods.
Semi-supervised learning differs from related learning paradigms. Understanding these distinctions helps choose the right approach for your problem.
Requires active querying of unlabeled samples' true labels from external sources (e.g., human experts)
Key Assumption:
Labeling cost is high but querying is feasible
Typical Application:
Medical data annotation (requires expert confirmation)
Unlabeled samples Du are not the data to be predicted (open-world setting)
Key Assumption:
New data will continuously arrive
Typical Application:
General classification tasks (e.g., spam email detection)
Unlabeled samples Du are exactly the data to be predicted (closed-world setting)
Key Assumption:
Only need to predict known unlabeled samples
Typical Application:
Fixed dataset classification (e.g., specific batch of documents)
| Paradigm | External Query | Unlabeled Data | Setting |
|---|---|---|---|
| Active Learning | Required | Query pool | Interactive |
| Pure Semi-Supervised | Not required | Training data (not test) | Open-world |
| Transductive Learning | Not required | Test data (to be predicted) | Closed-world |
A: Use semi-supervised learning when you have limited labeled data but abundant unlabeled data, and when the core assumptions (cluster or manifold) are likely to hold. Common scenarios include email spam detection, document classification, and image recognition where labeling is expensive.
A: The cluster assumption focuses on discrete groups (clusters) where samples in the same cluster share labels. The manifold assumption focuses on continuous structure where nearby samples on a low-dimensional manifold have similar labels. Many methods combine both assumptions.
A: Yes! If the core assumptions are violated, unlabeled samples can degrade performance. Always validate assumptions and compare semi-supervised results against supervised-only baselines.
A: There's no fixed rule, but typically you need enough labeled samples to train a reasonable initial model (e.g., 50-500 samples depending on problem complexity). More labeled data helps, but semi-supervised learning is most valuable when labeled data is scarce.