Machine Learning/Learning Center/Semi-Supervised Learning/Semi-Supervised SVM

Semi-Supervised SVM (TSVM)

Master Transductive SVM for semi-supervised learning. Learn how to leverage unlabeled samples to optimize the decision boundary through low-density separation principle.

Module 3 of 6

Intermediate to Advanced

100-130 min

Core Idea: Low-Density Separation

Based on the cluster assumption, the optimal decision boundary should pass through low-density regions (sparse areas between clusters) rather than high-density regions. Traditional SVM only uses labeled samples; TSVM additionally uses unlabeled samples to optimize the hyperplane position.

Key Principle

The decision boundary should not only separate labeled samples correctly but also pass through regions where unlabeled samples are sparse. This aligns with the cluster assumption: low-density regions are likely boundaries between clusters (classes).

Advantages

• High classification accuracy
• Strong robustness
• Leverages unlabeled data
• Clear geometric intuition

Limitations

• Computationally complex
• Difficult for multi-class
• Sensitive to initialization
• Requires tuning C_l, C_u

TSVM (Transductive SVM) Algorithm

TSVM assigns pseudo-labels $\hat{y}$ to unlabeled samples and optimizes the SVM objective function with both labeled and pseudo-labeled samples.

Optimization Objective

\min_{w,b,\hat{y},\xi} \frac{1}{2}\|w\|_2^2 + C_l \sum_{i=1}^l \xi_i + C_u \sum_{i=l+1}^m \xi_i

Subject to:

y_i(w \cdot x_i + b) \geq 1 - \xi_i \quad (\xi_i \geq 0, i=1,\ldots,m)

Where:

• $\|w\|_2^2$ = margin maximization term (traditional SVM)
• $C_l$ = penalty parameter for labeled samples (controls misclassification cost)
• $C_u$ = penalty parameter for unlabeled samples (controls pseudo-label confidence)
• $\xi_i$ = slack variables (allow some misclassification)
• $\hat{y}$ = pseudo-labels for unlabeled samples (values in $\{+1,-1\}$ )
• Typically, start with $C_u < C_l$ (less confidence in pseudo-labels initially)

Step 1

Initialization

Train a standard SVM using only labeled samples $D_l$ . Use this model to predict pseudo-labels $\hat{y}$ for unlabeled samples $D_u$ .

Example: If SVM predicts $f(x_j) > 0$ , assign $\hat{y}_j = +1$ ; otherwise $\hat{y}_j = -1$ .

Step 2

Iterative Optimization

Set initial $C_u < C_l$ (e.g., $C_u = 0.1 \cdot C_l$ ). Solve the optimization problem to obtain hyperplane $(w, b)$ .

Note: Lower $C_u$ means less penalty for pseudo-label errors, allowing the algorithm to explore different pseudo-label assignments.

Step 3

Check Margin Conflicts

Check if pseudo-labels are consistent with the margin:

• If a sample with $\hat{y}_j = +1$ has margin $< 0$ (wrong side), it's a conflict
• If a sample with $\hat{y}_j = -1$ has margin $> 0$ (wrong side), it's a conflict
• Swap pseudo-labels of conflicting samples

Step 4

Gradually Increase C_u

Increase $C_u$ (e.g., $C_u = \min\{2C_u, C_l\}$ ) and repeat steps 2-3 until $C_u = C_l$ .

Rationale: Gradually increasing confidence in pseudo-labels allows the algorithm to refine the decision boundary progressively.

Step 5

Output

Final pseudo-labels $\hat{y}$ for unlabeled samples are the prediction results (transductive learning: predicting known unlabeled samples).

Handling Class Imbalance

If pseudo-labels result in severe class imbalance (e.g., 90% positive, 10% negative), it can bias the SVM training. Solution: split $C_u$ into $C_u^+$ and $C_u^-$ .

Balanced Cost Adjustment

C_u^+ = \frac{u_-}{u_+} C_u^-

Where:

• $u_+$ = number of pseudo-labeled positive samples
• $u_-$ = number of pseudo-labeled negative samples
• This ensures balanced penalty for both classes

Example

If after pseudo-labeling we have 90 positive and 10 negative samples:

• $u_+ = 90$ , $u_- = 10$
• Set $C_u^- = 0.1$ (base value)
• Then $C_u^+ = \frac{10}{90} \cdot 0.1 = 0.011$
• This reduces penalty for the majority class, balancing the optimization

Medical Diagnosis Example

Apply TSVM to diagnose a disease using 200 patient records. Only 50 patients have confirmed diagnoses (labeled), while 150 patients lack diagnosis (unlabeled). Features include age, fever temperature, cough presence, and fatigue level.

Patient Dataset: Limited Labels

ID	Age	Fever (°C)	Cough	Fatigue	Label	Type
1	45	38.5	Yes	Yes	Positive	labeled
2	32	36.8	No	No	Negative	labeled
3	58	37.2	No	Yes	Negative	labeled
4	29	38.8	Yes	Yes	Positive	labeled
5	41	36.9	No	No	Negative	labeled
6	52	37.5	Yes	No	Unknown	unlabeled
7	35	38.2	Yes	Yes	Unknown	unlabeled
8	48	36.7	No	No	Unknown	unlabeled
9	27	37.8	No	Yes	Unknown	unlabeled
10	61	38.9	Yes	Yes	Unknown	unlabeled

Dataset: 200 patients (50 labeled: 25 Positive, 25 Negative; 150 unlabeled). Binary classification: Positive (disease present) vs Negative (disease absent).

TSVM Training Process

Initialization:

Train standard SVM on 50 labeled patients. Predict pseudo-labels for 150 unlabeled patients.

• Initial accuracy on labeled: 84%
• Pseudo-labels: 78 positive, 72 negative (relatively balanced)

Iteration 1 (C_u = 0.1 * C_l):

Optimize with low confidence in pseudo-labels. Check margin conflicts and swap 12 conflicting pseudo-labels.

Iteration 2-4 (C_u gradually increases):

Continue optimization with increasing confidence. Decision boundary shifts to low-density regions.

Final Result (C_u = C_l):

Final model achieves 91% accuracy on test set (vs 84% using labeled samples only). Decision boundary passes through low-density region between positive and negative clusters.

Advantages and Limitations

Advantages

• High classification accuracy: Leverages unlabeled data effectively
• Strong robustness: Margin maximization provides good generalization
• Clear geometric intuition: Low-density separation is easy to understand
• Well-established theory: Based on solid SVM foundation

Limitations

• Computationally complex: Requires solving optimization with pseudo-labels
• Difficult for multi-class: Primarily designed for binary classification
• Sensitive to initialization: Poor initial pseudo-labels can lead to local optima
• Requires parameter tuning: C_l and C_u need careful selection

Next Module