MathIsimple

Semi-Supervised Clustering

Master constrained clustering with must-link and cannot-link constraints. Learn how to incorporate domain knowledge into clustering algorithms to improve cluster quality and interpretability.

Module 6 of 6
Intermediate to Advanced
100-130 min

What is Semi-Supervised Clustering?

Semi-supervised clustering incorporates supervisory information (constraints or labeled samples) into the clustering process. Unlike traditional unsupervised clustering, it uses domain knowledge to guide cluster formation, resulting in more meaningful and interpretable clusters.

Types of Supervisory Information

  • Must-Link (M): Sample pair (xi,xj)(x_i, x_j) must belong to the same cluster
  • Cannot-Link (C): Sample pair (xi,xj)(x_i, x_j) must belong to different clusters
  • Seeded samples: A small set of labeled samples (seeds) with known cluster assignments

Advantages

  • • Incorporates domain knowledge
  • • More interpretable clusters
  • • Better cluster quality
  • • Handles constraints naturally

Challenges

  • • Requires constraint information
  • • Constraint conflicts possible
  • • More complex algorithms
  • • Need to validate constraints

Constraint Types

Must-Link Constraint (M)

Specifies that two samples must be assigned to the same cluster.

Examples:

  • • Two photos from the same user must be in the same cluster
  • • Documents from the same author must be grouped together
  • • Products from the same category must be clustered together
M={(xi,xj)xi and xj must be in same cluster}M = \{(x_i, x_j) | x_i \text{ and } x_j \text{ must be in same cluster}\}

Cannot-Link Constraint (C)

Specifies that two samples cannot be assigned to the same cluster.

Examples:

  • • Photos from different users cannot be in the same cluster
  • • Documents from different authors must be separated
  • • Products from different categories cannot be grouped together
C={(xi,xj)xi and xj cannot be in same cluster}C = \{(x_i, x_j) | x_i \text{ and } x_j \text{ cannot be in same cluster}\}

Constrained k-Means Algorithm

Extends traditional k-means by adding constraint checking during cluster assignment. Samples are assigned to clusters that minimize distance while satisfying all constraints.

Step 1

Initialization

Randomly select k samples from dataset D as initial cluster centers {μ1,μ2,,μk}\{\mu_1, \mu_2, \ldots, \mu_k\}.

Step 2

Iterative Clustering

For each sample xix_i:

  1. a) Compute distance dij=xiμj2d_{ij} = \|x_i - \mu_j\|_2 to all cluster centers
  2. b) Find cluster r=argminjKdijr = \arg\min_{j \in K} d_{ij} with minimum distance (K = valid clusters)
  3. c) Constraint Check:
    • • If assigning xix_i to CrC_r violates a must-link constraint (e.g., xix_i must-link with xjx_j but xjCrx_j \notin C_r), exclude r and find next nearest cluster
    • • If assigning xix_i to CrC_r violates a cannot-link constraint (e.g., xix_i cannot-link with xjx_j but xjCrx_j \in C_r), exclude r and find next nearest cluster
    • • If all clusters violate constraints, return error (constraint conflict)
  4. d) Assign xix_i to cluster CrC_r
Step 3

Update Cluster Centers

For each cluster j, update center:

μj=1CjxCjx\mu_j = \frac{1}{|C_j|} \sum_{x \in C_j} x
Step 4

Termination

Repeat steps 2-3 until cluster centers no longer change (convergence). Output final cluster assignment{C1,C2,,Ck}\{C_1, C_2, \ldots, C_k\}.

Seeded Constrained k-Means

Uses labeled samples (seeds) as initial cluster centers. Seeds are fixed in their assigned clusters, and only unlabeled samples are clustered. This provides stronger guidance than constraints alone.

Algorithm Steps

  1. 1. Seed Initialization: Let SjS_j be the set of labeled seed samples for cluster j. Fix all seeds: xSj,xCj\forall x \in S_j, x \in C_j (cannot be reassigned).
  2. 2. Initial Cluster Centers: Compute initial centers using only seeds:
    μj=1SjxSjx\mu_j = \frac{1}{|S_j|} \sum_{x \in S_j} x
  3. 3. Iterative Clustering: For each unlabeled sample xiDSjx_i \in D \setminus \bigcup S_j:
    • • Compute distance to all cluster centers
    • • Assign to nearest cluster (no constraint checking needed if seeds are correct)
  4. 4. Update Centers: Update μj\mu_j using both seeds and newly assigned unlabeled samples
  5. 5. Termination: Repeat until convergence

Advantages of Seeded Approach

  • Stronger guidance: Seeds provide clear cluster structure
  • Faster convergence: Initial centers are better positioned
  • More stable: Less sensitive to initialization
  • Interpretable: Clusters are anchored to known examples

Customer Grouping Example

Apply constrained k-means to group 200 customers into 3 segments. Domain knowledge provides constraints: customers from the same household (must-link) and customers from different households (cannot-link).

Customer Dataset with Constraints

IDAgeIncomeSpendingConstraints
128$45,000$3,200None
245$85,000$8,500None
322$28,000$1,200None
452$120,000$12,000
must-link: 25
535$65,000$4,800None
631$58,000$4,200
must-link: 1
748$95,000$9,800None
829$42,000$2,800
cannot-link: 2
955$110,000$11,500
must-link: 4
1026$35,000$2,100None

Dataset: 200 customers. Constraints: M = {(4,25), (1,6), (4,9)} (same household), C = {(8,2)} (different households). Goal: 3 clusters (Budget, Moderate, Affluent).

Constrained k-Means Process

Initialization:

Randomly select 3 customers as initial centers: μ₁, μ₂, μ₃.

Iteration 1:

Customer 4 is closest to μ₂, but has must-link with customer 25. Check: if customer 25 is not in cluster 2, exclude cluster 2 and assign customer 4 to next nearest cluster. Customer 8 is closest to μ₁, but has cannot-link with customer 2. Check: if customer 2 is in cluster 1, exclude cluster 1 and assign to next nearest.

Iteration 2-5:

Continue clustering with constraint checking. Update centers after each iteration.

Final Result (Iteration 5):

Clusters: C₁ (Budget: 65 customers), C₂ (Moderate: 72 customers), C₃ (Affluent: 63 customers). All constraints satisfied: customers 4, 25, 9 in same cluster; customers 1, 6 in same cluster; customer 8 separated from customer 2. Cluster quality improved compared to unconstrained k-means.

Advantages and Limitations

Advantages

  • Incorporates domain knowledge: Uses expert knowledge effectively
  • More interpretable clusters: Clusters align with domain understanding
  • Better cluster quality: Constraints guide better separation
  • Handles constraints naturally: Algorithms designed for constraints

Limitations

  • Requires constraint information: Need domain knowledge or labeled samples
  • Constraint conflicts possible: May have contradictory constraints
  • More complex algorithms: Constraint checking adds complexity
  • Need to validate constraints: Incorrect constraints degrade performance