Machine Learning/Learning Center/Semi-Supervised Learning/Core Fundamentals

Semi-Supervised Learning Core Fundamentals

Understand the foundation of semi-supervised learning: sample definitions, core assumptions, and learning paradigms that enable learning from limited labeled data.

Module 1 of 6

Intermediate Level

90-120 min

What is Semi-Supervised Learning?

Semi-supervised learning is a learning paradigm that leverages both labeled and unlabeled samples to improve learning performance. Unlike supervised learning (which requires all samples to be labeled) or unsupervised learning (which uses no labels), semi-supervised learning addresses the common real-world scenario where labeled data is scarce but unlabeled data is abundant.

Learning Objective

Without relying on external interaction, automatically utilize unlabeled samples $D_u$ to improve the learner's prediction performance on "samples to be classified." The core goal is to compensate for the insufficiency of labeled samples.

Sample Definitions and Classification

In semi-supervised learning, samples are divided into two categories based on whether their class labels are known.

Labeled Sample Set $D_l$

D_l = \{(x_1, y_1), (x_2, y_2), \ldots, (x_l, y_l)\}

Samples where both the input $x$ and the corresponding class label $y$ are known.

Example:

"Image 1 is a cat" - we know both the image (input) and its label (cat).

Unlabeled Sample Set $D_u$

D_u = \{x_{l+1}, x_{l+2}, \ldots, x_{l+u}\}

Samples where only the input $x$ is known, but the class label $y$ is unknown.

Example:

"Image 2" - we know the image (input) but not its category (label unknown).

Notation Summary

• $l$ = number of labeled samples
• $u$ = number of unlabeled samples
• $m = l + u$ = total number of samples
• Typically, $u \gg l$ (unlabeled samples far outnumber labeled samples)

Email Classification Example

Consider a spam email detection system with 200 emails. Only 50 emails have been manually labeled (as "Spam" or "Ham"), while 150 emails remain unlabeled. This is a typical semi-supervised learning scenario.

Email Dataset: Labeled vs Unlabeled Samples

ID	Sender	Subject	Words	Links	Label	Type
1	noreply@bank.com	Verify your account	45	2	Spam	labeled
2	john@company.com	Meeting tomorrow	120	0	Ham	labeled
3	newsletter@store.com	Summer sale	80	3	Ham	labeled
4	support@service.com	Your order shipped	95	1	Ham	labeled
5	promo@website.com	Win $1000 now	30	5	Spam	labeled
6	team@startup.com	Weekly update	150	0	Unknown	unlabeled
7	deals@retail.com	Limited time offer	60	4	Unknown	unlabeled
8	client@business.com	Project proposal	200	1	Unknown	unlabeled
9	alert@security.com	Suspicious activity	70	2	Unknown	unlabeled
10	info@conference.com	Registration open	110	1	Unknown	unlabeled

Dataset contains 200 emails total: 50 labeled (25 Spam, 25 Ham) and 150 unlabeled. Features include sender domain, subject length, word count, and number of links.

Labeled Set $D_l$

• 50 emails with known labels
• Used for initial model training
• Examples: IDs 1-5 shown above
• Provides ground truth for learning

Unlabeled Set $D_u$

• 150 emails without labels
• Used to improve model performance
• Examples: IDs 6-10 shown above
• Leveraged through core assumptions

Two Core Assumptions

The effectiveness of semi-supervised learning relies on key assumptions about data distribution. All semi-supervised methods are designed around these two fundamental assumptions:

1. Cluster Assumption

Data exhibits a "cluster structure" where samples within the same cluster are highly likely to belong to the same class.

Intuition:

"Similar samples (in the same cluster) should have the same label." For example, if multiple similar images form a cluster, they are likely all dogs or all cats.

Mathematical Formulation:

If samples $x_i$ and $x_j$ are in the same cluster, then $P(y_i = y_j | x_i, x_j \text{ in same cluster}) \approx 1$

Methods Based on This Assumption:

• Semi-Supervised SVM (TSVM) - places decision boundary in low-density regions
• Constrained clustering - enforces same-label constraint within clusters

2. Manifold Assumption

Data lies on a low-dimensional manifold embedded in high-dimensional space. Neighboring samples on the manifold should have similar output values.

Intuition:

"Nearby samples (on the manifold) should have similar labels." For example, images with similar pixel patterns are more likely to have the same category.

Mathematical Formulation:

If $\|x_i - x_j\|$ is small (samples are close on the manifold), then $|f(x_i) - f(x_j)|$ should be small (similar predictions).

Methods Based on This Assumption:

• Graph-based learning - propagates labels along graph edges
• Manifold regularization - enforces smoothness on the manifold

Important Note

If these assumptions are violated (e.g., data doesn't have cluster structure or doesn't lie on a manifold), unlabeled samples may actually degrade performance compared to using only labeled samples. Always validate assumptions before applying semi-supervised methods.

Learning Paradigms Comparison

Semi-supervised learning differs from related learning paradigms. Understanding these distinctions helps choose the right approach for your problem.

Active Learning

Requires active querying of unlabeled samples' true labels from external sources (e.g., human experts)

Key Assumption:

Labeling cost is high but querying is feasible

Typical Application:

Medical data annotation (requires expert confirmation)

Pure Semi-Supervised Learning

Unlabeled samples Du are not the data to be predicted (open-world setting)

Key Assumption:

New data will continuously arrive

Typical Application:

General classification tasks (e.g., spam email detection)

Transductive Learning

Unlabeled samples Du are exactly the data to be predicted (closed-world setting)

Key Assumption:

Only need to predict known unlabeled samples

Typical Application:

Fixed dataset classification (e.g., specific batch of documents)

Comparison Table

Paradigm	External Query	Unlabeled Data	Setting
Active Learning	Required	Query pool	Interactive
Pure Semi-Supervised	Not required	Training data (not test)	Open-world
Transductive Learning	Not required	Test data (to be predicted)	Closed-world

Frequently Asked Questions

Q: When should I use semi-supervised learning?

A: Use semi-supervised learning when you have limited labeled data but abundant unlabeled data, and when the core assumptions (cluster or manifold) are likely to hold. Common scenarios include email spam detection, document classification, and image recognition where labeling is expensive.

Q: What's the difference between cluster assumption and manifold assumption?

A: The cluster assumption focuses on discrete groups (clusters) where samples in the same cluster share labels. The manifold assumption focuses on continuous structure where nearby samples on a low-dimensional manifold have similar labels. Many methods combine both assumptions.

Q: Can unlabeled data hurt performance?

A: Yes! If the core assumptions are violated, unlabeled samples can degrade performance. Always validate assumptions and compare semi-supervised results against supervised-only baselines.

Q: How much labeled data do I need?

A: There's no fixed rule, but typically you need enough labeled samples to train a reasonable initial model (e.g., 50-500 samples depending on problem complexity). More labeled data helps, but semi-supervised learning is most valuable when labeled data is scarce.

Next Module

Semi-Supervised Learning Core Fundamentals

What is Semi-Supervised Learning?

Learning Objective

Sample Definitions and Classification

Labeled Sample Set DlD_lDl​

Unlabeled Sample Set DuD_uDu​

Notation Summary

Email Classification Example

Email Dataset: Labeled vs Unlabeled Samples

Labeled Set DlD_lDl​

Unlabeled Set DuD_uDu​

Two Core Assumptions

1. Cluster Assumption

2. Manifold Assumption

Important Note

Learning Paradigms Comparison

Active Learning

Pure Semi-Supervised Learning

Transductive Learning

Comparison Table

Frequently Asked Questions

Q: When should I use semi-supervised learning?

Q: What's the difference between cluster assumption and manifold assumption?

Q: Can unlabeled data hurt performance?

Q: How much labeled data do I need?

Labeled Sample Set $D_l$

Unlabeled Sample Set $D_u$

Labeled Set $D_l$

Unlabeled Set $D_u$