Machine Learning/Learning Center/Semi-Supervised Learning/Semi-Supervised Clustering

Semi-Supervised Clustering

Master constrained clustering with must-link and cannot-link constraints. Learn how to incorporate domain knowledge into clustering algorithms to improve cluster quality and interpretability.

Module 6 of 6

Intermediate to Advanced

100-130 min

What is Semi-Supervised Clustering?

Semi-supervised clustering incorporates supervisory information (constraints or labeled samples) into the clustering process. Unlike traditional unsupervised clustering, it uses domain knowledge to guide cluster formation, resulting in more meaningful and interpretable clusters.

Types of Supervisory Information

• Must-Link (M): Sample pair $(x_i, x_j)$ must belong to the same cluster
• Cannot-Link (C): Sample pair $(x_i, x_j)$ must belong to different clusters
• Seeded samples: A small set of labeled samples (seeds) with known cluster assignments

Advantages

• Incorporates domain knowledge
• More interpretable clusters
• Better cluster quality
• Handles constraints naturally

Challenges

• Requires constraint information
• Constraint conflicts possible
• More complex algorithms
• Need to validate constraints

Constraint Types

Must-Link Constraint (M)

Specifies that two samples must be assigned to the same cluster.

Examples:

• Two photos from the same user must be in the same cluster
• Documents from the same author must be grouped together
• Products from the same category must be clustered together

M = \{(x_i, x_j) | x_i \text{ and } x_j \text{ must be in same cluster}\}

Cannot-Link Constraint (C)

Specifies that two samples cannot be assigned to the same cluster.

Examples:

• Photos from different users cannot be in the same cluster
• Documents from different authors must be separated
• Products from different categories cannot be grouped together

C = \{(x_i, x_j) | x_i \text{ and } x_j \text{ cannot be in same cluster}\}

Constrained k-Means Algorithm

Extends traditional k-means by adding constraint checking during cluster assignment. Samples are assigned to clusters that minimize distance while satisfying all constraints.

Step 1

Initialization

Randomly select k samples from dataset D as initial cluster centers $\{\mu_1, \mu_2, \ldots, \mu_k\}$ .

Step 2

Iterative Clustering

For each sample $x_i$ :

a) Compute distance $d_{ij} = \|x_i - \mu_j\|_2$ to all cluster centers
b) Find cluster $r = \arg\min_{j \in K} d_{ij}$ with minimum distance (K = valid clusters)
c) Constraint Check:
- • If assigning $x_i$ to $C_r$ violates a must-link constraint (e.g., $x_i$ must-link with $x_j$ but $x_j \notin C_r$ ), exclude r and find next nearest cluster
- • If assigning $x_i$ to $C_r$ violates a cannot-link constraint (e.g., $x_i$ cannot-link with $x_j$ but $x_j \in C_r$ ), exclude r and find next nearest cluster
- • If all clusters violate constraints, return error (constraint conflict)
d) Assign $x_i$ to cluster $C_r$

Step 3

Update Cluster Centers

For each cluster j, update center:

\mu_j = \frac{1}{|C_j|} \sum_{x \in C_j} x

Step 4

Termination

Repeat steps 2-3 until cluster centers no longer change (convergence). Output final cluster assignment $\{C_1, C_2, \ldots, C_k\}$ .

Seeded Constrained k-Means

Uses labeled samples (seeds) as initial cluster centers. Seeds are fixed in their assigned clusters, and only unlabeled samples are clustered. This provides stronger guidance than constraints alone.

Algorithm Steps

1. Seed Initialization: Let $S_j$ be the set of labeled seed samples for cluster j. Fix all seeds: $\forall x \in S_j, x \in C_j$ (cannot be reassigned).
2. Initial Cluster Centers: Compute initial centers using only seeds:
$\mu_j = \frac{1}{|S_j|} \sum_{x \in S_j} x$
3. Iterative Clustering: For each unlabeled sample $x_i \in D \setminus \bigcup S_j$ :
- • Compute distance to all cluster centers
- • Assign to nearest cluster (no constraint checking needed if seeds are correct)
4. Update Centers: Update $\mu_j$ using both seeds and newly assigned unlabeled samples
5. Termination: Repeat until convergence

Advantages of Seeded Approach

• Stronger guidance: Seeds provide clear cluster structure
• Faster convergence: Initial centers are better positioned
• More stable: Less sensitive to initialization
• Interpretable: Clusters are anchored to known examples

Customer Grouping Example

Apply constrained k-means to group 200 customers into 3 segments. Domain knowledge provides constraints: customers from the same household (must-link) and customers from different households (cannot-link).

Customer Dataset with Constraints

ID	Age	Income	Spending	Constraints
1	28	$45,000	$3,200	None
2	45	$85,000	$8,500	None
3	22	$28,000	$1,200	None
4	52	$120,000	$12,000	must-link: 25
5	35	$65,000	$4,800	None
6	31	$58,000	$4,200	must-link: 1
7	48	$95,000	$9,800	None
8	29	$42,000	$2,800	cannot-link: 2
9	55	$110,000	$11,500	must-link: 4
10	26	$35,000	$2,100	None

Dataset: 200 customers. Constraints: M = {(4,25), (1,6), (4,9)} (same household), C = {(8,2)} (different households). Goal: 3 clusters (Budget, Moderate, Affluent).

Constrained k-Means Process

Initialization:

Randomly select 3 customers as initial centers: μ₁, μ₂, μ₃.

Iteration 1:

Customer 4 is closest to μ₂, but has must-link with customer 25. Check: if customer 25 is not in cluster 2, exclude cluster 2 and assign customer 4 to next nearest cluster. Customer 8 is closest to μ₁, but has cannot-link with customer 2. Check: if customer 2 is in cluster 1, exclude cluster 1 and assign to next nearest.

Iteration 2-5:

Continue clustering with constraint checking. Update centers after each iteration.

Final Result (Iteration 5):

Clusters: C₁ (Budget: 65 customers), C₂ (Moderate: 72 customers), C₃ (Affluent: 63 customers). All constraints satisfied: customers 4, 25, 9 in same cluster; customers 1, 6 in same cluster; customer 8 separated from customer 2. Cluster quality improved compared to unconstrained k-means.

Advantages and Limitations

Advantages

• Incorporates domain knowledge: Uses expert knowledge effectively
• More interpretable clusters: Clusters align with domain understanding
• Better cluster quality: Constraints guide better separation
• Handles constraints naturally: Algorithms designed for constraints

Limitations

• Requires constraint information: Need domain knowledge or labeled samples
• Constraint conflicts possible: May have contradictory constraints
• More complex algorithms: Constraint checking adds complexity
• Need to validate constraints: Incorrect constraints degrade performance

Module Complete