MathIsimple

Semi-Supervised Learning Core Fundamentals

Understand the foundation of semi-supervised learning: sample definitions, core assumptions, and learning paradigms that enable learning from limited labeled data.

Module 1 of 6
Intermediate Level
90-120 min

What is Semi-Supervised Learning?

Semi-supervised learning is a learning paradigm that leverages both labeled and unlabeled samples to improve learning performance. Unlike supervised learning (which requires all samples to be labeled) or unsupervised learning (which uses no labels), semi-supervised learning addresses the common real-world scenario where labeled data is scarce but unlabeled data is abundant.

Learning Objective

Without relying on external interaction, automatically utilize unlabeled samples DuD_u to improve the learner's prediction performance on "samples to be classified." The core goal is to compensate for the insufficiency of labeled samples.

Sample Definitions and Classification

In semi-supervised learning, samples are divided into two categories based on whether their class labels are known.

Labeled Sample Set DlD_l

Dl={(x1,y1),(x2,y2),,(xl,yl)}D_l = \{(x_1, y_1), (x_2, y_2), \ldots, (x_l, y_l)\}

Samples where both the input xx and the corresponding class label yy are known.

Example:

"Image 1 is a cat" - we know both the image (input) and its label (cat).

Unlabeled Sample Set DuD_u

Du={xl+1,xl+2,,xl+u}D_u = \{x_{l+1}, x_{l+2}, \ldots, x_{l+u}\}

Samples where only the input xx is known, but the class label yy is unknown.

Example:

"Image 2" - we know the image (input) but not its category (label unknown).

Notation Summary

  • ll = number of labeled samples
  • uu = number of unlabeled samples
  • m=l+um = l + u = total number of samples
  • • Typically, ulu \gg l (unlabeled samples far outnumber labeled samples)

Email Classification Example

Consider a spam email detection system with 200 emails. Only 50 emails have been manually labeled (as "Spam" or "Ham"), while 150 emails remain unlabeled. This is a typical semi-supervised learning scenario.

Email Dataset: Labeled vs Unlabeled Samples

IDSenderSubjectWordsLinksLabelType
1noreply@bank.comVerify your account452
Spam
labeled
2john@company.comMeeting tomorrow1200
Ham
labeled
3newsletter@store.comSummer sale803
Ham
labeled
4support@service.comYour order shipped951
Ham
labeled
5promo@website.comWin $1000 now305
Spam
labeled
6team@startup.comWeekly update1500Unknown
unlabeled
7deals@retail.comLimited time offer604Unknown
unlabeled
8client@business.comProject proposal2001Unknown
unlabeled
9alert@security.comSuspicious activity702Unknown
unlabeled
10info@conference.comRegistration open1101Unknown
unlabeled

Dataset contains 200 emails total: 50 labeled (25 Spam, 25 Ham) and 150 unlabeled. Features include sender domain, subject length, word count, and number of links.

Labeled Set DlD_l

  • • 50 emails with known labels
  • • Used for initial model training
  • • Examples: IDs 1-5 shown above
  • • Provides ground truth for learning

Unlabeled Set DuD_u

  • • 150 emails without labels
  • • Used to improve model performance
  • • Examples: IDs 6-10 shown above
  • • Leveraged through core assumptions

Two Core Assumptions

The effectiveness of semi-supervised learning relies on key assumptions about data distribution. All semi-supervised methods are designed around these two fundamental assumptions:

1. Cluster Assumption

Data exhibits a "cluster structure" where samples within the same cluster are highly likely to belong to the same class.

Intuition:

"Similar samples (in the same cluster) should have the same label." For example, if multiple similar images form a cluster, they are likely all dogs or all cats.

Mathematical Formulation:

If samples xix_i and xjx_j are in the same cluster, then P(yi=yjxi,xj in same cluster)1P(y_i = y_j | x_i, x_j \text{ in same cluster}) \approx 1

Methods Based on This Assumption:

  • • Semi-Supervised SVM (TSVM) - places decision boundary in low-density regions
  • • Constrained clustering - enforces same-label constraint within clusters

2. Manifold Assumption

Data lies on a low-dimensional manifold embedded in high-dimensional space. Neighboring samples on the manifold should have similar output values.

Intuition:

"Nearby samples (on the manifold) should have similar labels." For example, images with similar pixel patterns are more likely to have the same category.

Mathematical Formulation:

If xixj\|x_i - x_j\| is small (samples are close on the manifold), then f(xi)f(xj)|f(x_i) - f(x_j)| should be small (similar predictions).

Methods Based on This Assumption:

  • • Graph-based learning - propagates labels along graph edges
  • • Manifold regularization - enforces smoothness on the manifold

Important Note

If these assumptions are violated (e.g., data doesn't have cluster structure or doesn't lie on a manifold), unlabeled samples may actually degrade performance compared to using only labeled samples. Always validate assumptions before applying semi-supervised methods.

Learning Paradigms Comparison

Semi-supervised learning differs from related learning paradigms. Understanding these distinctions helps choose the right approach for your problem.

Active Learning

Requires active querying of unlabeled samples' true labels from external sources (e.g., human experts)

Key Assumption:

Labeling cost is high but querying is feasible

Typical Application:

Medical data annotation (requires expert confirmation)

Pure Semi-Supervised Learning

Unlabeled samples Du are not the data to be predicted (open-world setting)

Key Assumption:

New data will continuously arrive

Typical Application:

General classification tasks (e.g., spam email detection)

Transductive Learning

Unlabeled samples Du are exactly the data to be predicted (closed-world setting)

Key Assumption:

Only need to predict known unlabeled samples

Typical Application:

Fixed dataset classification (e.g., specific batch of documents)

Comparison Table

ParadigmExternal QueryUnlabeled DataSetting
Active LearningRequiredQuery poolInteractive
Pure Semi-SupervisedNot requiredTraining data (not test)Open-world
Transductive LearningNot requiredTest data (to be predicted)Closed-world

Frequently Asked Questions

Q: When should I use semi-supervised learning?

A: Use semi-supervised learning when you have limited labeled data but abundant unlabeled data, and when the core assumptions (cluster or manifold) are likely to hold. Common scenarios include email spam detection, document classification, and image recognition where labeling is expensive.

Q: What's the difference between cluster assumption and manifold assumption?

A: The cluster assumption focuses on discrete groups (clusters) where samples in the same cluster share labels. The manifold assumption focuses on continuous structure where nearby samples on a low-dimensional manifold have similar labels. Many methods combine both assumptions.

Q: Can unlabeled data hurt performance?

A: Yes! If the core assumptions are violated, unlabeled samples can degrade performance. Always validate assumptions and compare semi-supervised results against supervised-only baselines.

Q: How much labeled data do I need?

A: There's no fixed rule, but typically you need enough labeled samples to train a reasonable initial model (e.g., 50-500 samples depending on problem complexity). More labeled data helps, but semi-supervised learning is most valuable when labeled data is scarce.