Master constrained clustering with must-link and cannot-link constraints. Learn how to incorporate domain knowledge into clustering algorithms to improve cluster quality and interpretability.
Semi-supervised clustering incorporates supervisory information (constraints or labeled samples) into the clustering process. Unlike traditional unsupervised clustering, it uses domain knowledge to guide cluster formation, resulting in more meaningful and interpretable clusters.
Specifies that two samples must be assigned to the same cluster.
Examples:
Specifies that two samples cannot be assigned to the same cluster.
Examples:
Extends traditional k-means by adding constraint checking during cluster assignment. Samples are assigned to clusters that minimize distance while satisfying all constraints.
Randomly select k samples from dataset D as initial cluster centers .
For each sample :
For each cluster j, update center:
Repeat steps 2-3 until cluster centers no longer change (convergence). Output final cluster assignment.
Uses labeled samples (seeds) as initial cluster centers. Seeds are fixed in their assigned clusters, and only unlabeled samples are clustered. This provides stronger guidance than constraints alone.
Apply constrained k-means to group 200 customers into 3 segments. Domain knowledge provides constraints: customers from the same household (must-link) and customers from different households (cannot-link).
| ID | Age | Income | Spending | Constraints |
|---|---|---|---|---|
| 1 | 28 | $45,000 | $3,200 | None |
| 2 | 45 | $85,000 | $8,500 | None |
| 3 | 22 | $28,000 | $1,200 | None |
| 4 | 52 | $120,000 | $12,000 | must-link: 25 |
| 5 | 35 | $65,000 | $4,800 | None |
| 6 | 31 | $58,000 | $4,200 | must-link: 1 |
| 7 | 48 | $95,000 | $9,800 | None |
| 8 | 29 | $42,000 | $2,800 | cannot-link: 2 |
| 9 | 55 | $110,000 | $11,500 | must-link: 4 |
| 10 | 26 | $35,000 | $2,100 | None |
Dataset: 200 customers. Constraints: M = {(4,25), (1,6), (4,9)} (same household), C = {(8,2)} (different households). Goal: 3 clusters (Budget, Moderate, Affluent).
Randomly select 3 customers as initial centers: μ₁, μ₂, μ₃.
Customer 4 is closest to μ₂, but has must-link with customer 25. Check: if customer 25 is not in cluster 2, exclude cluster 2 and assign customer 4 to next nearest cluster. Customer 8 is closest to μ₁, but has cannot-link with customer 2. Check: if customer 2 is in cluster 1, exclude cluster 1 and assign to next nearest.
Continue clustering with constraint checking. Update centers after each iteration.
Clusters: C₁ (Budget: 65 customers), C₂ (Moderate: 72 customers), C₃ (Affluent: 63 customers). All constraints satisfied: customers 4, 25, 9 in same cluster; customers 1, 6 in same cluster; customer 8 separated from customer 2. Cluster quality improved compared to unconstrained k-means.