Master Transductive SVM for semi-supervised learning. Learn how to leverage unlabeled samples to optimize the decision boundary through low-density separation principle.
Based on the cluster assumption, the optimal decision boundary should pass through low-density regions (sparse areas between clusters) rather than high-density regions. Traditional SVM only uses labeled samples; TSVM additionally uses unlabeled samples to optimize the hyperplane position.
The decision boundary should not only separate labeled samples correctly but also pass through regions where unlabeled samples are sparse. This aligns with the cluster assumption: low-density regions are likely boundaries between clusters (classes).
TSVM assigns pseudo-labels to unlabeled samples and optimizes the SVM objective function with both labeled and pseudo-labeled samples.
Subject to:
Where:
Train a standard SVM using only labeled samples . Use this model to predict pseudo-labels for unlabeled samples .
Example: If SVM predicts , assign ; otherwise .
Set initial (e.g., ). Solve the optimization problem to obtain hyperplane .
Note: Lower means less penalty for pseudo-label errors, allowing the algorithm to explore different pseudo-label assignments.
Check if pseudo-labels are consistent with the margin:
Increase (e.g., ) and repeat steps 2-3 until .
Rationale: Gradually increasing confidence in pseudo-labels allows the algorithm to refine the decision boundary progressively.
Final pseudo-labels for unlabeled samples are the prediction results (transductive learning: predicting known unlabeled samples).
If pseudo-labels result in severe class imbalance (e.g., 90% positive, 10% negative), it can bias the SVM training. Solution: split into and .
Where:
If after pseudo-labeling we have 90 positive and 10 negative samples:
Apply TSVM to diagnose a disease using 200 patient records. Only 50 patients have confirmed diagnoses (labeled), while 150 patients lack diagnosis (unlabeled). Features include age, fever temperature, cough presence, and fatigue level.
| ID | Age | Fever (°C) | Cough | Fatigue | Label | Type |
|---|---|---|---|---|---|---|
| 1 | 45 | 38.5 | Yes | Yes | Positive | labeled |
| 2 | 32 | 36.8 | No | No | Negative | labeled |
| 3 | 58 | 37.2 | No | Yes | Negative | labeled |
| 4 | 29 | 38.8 | Yes | Yes | Positive | labeled |
| 5 | 41 | 36.9 | No | No | Negative | labeled |
| 6 | 52 | 37.5 | Yes | No | Unknown | unlabeled |
| 7 | 35 | 38.2 | Yes | Yes | Unknown | unlabeled |
| 8 | 48 | 36.7 | No | No | Unknown | unlabeled |
| 9 | 27 | 37.8 | No | Yes | Unknown | unlabeled |
| 10 | 61 | 38.9 | Yes | Yes | Unknown | unlabeled |
Dataset: 200 patients (50 labeled: 25 Positive, 25 Negative; 150 unlabeled). Binary classification: Positive (disease present) vs Negative (disease absent).
Train standard SVM on 50 labeled patients. Predict pseudo-labels for 150 unlabeled patients.
Optimize with low confidence in pseudo-labels. Check margin conflicts and swap 12 conflicting pseudo-labels.
Continue optimization with increasing confidence. Decision boundary shifts to low-density regions.
Final model achieves 91% accuracy on test set (vs 84% using labeled samples only). Decision boundary passes through low-density region between positive and negative clusters.