Semi-Supervised Clustering: Add Hints, Sort Better

You Have 1,000 Old Photos to Organize

Weekend cleanup. You find a pile of old photos — 1,000 of them, all mixed up. You want to sort them into "Family," "Scenery," and "Food" categories for archiving.

Look at them one by one? Massive undertaking. Can the computer help?

Attempt 1: Let the Computer Auto-Sort (Regular Clustering)

You tell the computer: "Help me split these photos into 3 piles. Put similar ones together."

The computer gets to work:

Analyzes each photo's colors, composition, brightness
Calculates which photos "look alike"
Automatically creates 3 groups

10 minutes later: "Done!"

You check the results. Total mess.

Pile 1: Family selfies + restaurant food shots mixed together (both warm tones, close-up shots)

Pile 2: Beach scenery + family photos at the beach (both have blue sky, sand)

Pile 3: Architecture + city night scenes (both have straight lines, cool tones)

What went wrong? The computer only knows "put similar ones together," but has no idea you want "Family," "Scenery," "Food" — human-defined categories.

Similar colors and composition don't mean similar category.

Attempt 2: Give the Computer a Few Hints (Semi-Supervised Clustering)

You try a different approach. Tell the computer:

Hint 1 (Must-Link):

"These two are both of my mom, must be in the same pile."
"These two are both Huangshan scenery, same pile."

Hint 2 (Cannot-Link):

"This one's my dad at a restaurant (portrait), this one's the food on the table (food photo), absolutely separate."
"This beach scenery, this family at the beach — not the same type."

Hint 3 (Seed Samples):

"These 5 are family photos."
"These 5 are scenery photos."
"These 5 are food photos."

Total: 30 hints (15 must-link/cannot-link pairs + 15 seed samples).

The computer re-sorts. This time:

Places seed samples in corresponding piles (5 family → Pile A, 5 scenery → Pile B, 5 food → Pile C)
Calculates each pile's "feature center" (average color, composition of family photos)
Assigns the other 985 photos based on "closest to which center"
During assignment: checks for must-link/cannot-link violations

Result? 95% accuracy.

Family photos with family. Scenery with scenery. Food with food. Occasionally a few edge cases (family at beach vs beach scenery) might be misplaced, but overall quality vastly better.

Comparing the Two Attempts

Aspect	Attempt 1 (Regular Clustering)	Attempt 2 (Semi-Supervised)
Info Given	None (just "make 3 piles")	30 hints (must-link/cannot-link + seeds)
Sorting Basis	Pure feature similarity (color, composition)	Feature similarity + human hints
Accuracy	~60% (lots of confusion)	~95% (mostly correct)
Manual Effort	0 (fully automatic)	Very low (30 hints, 5 minutes)

Key Finding

Providing hints for just 3% of photos (30/1000) boosts accuracy from 60% to 95% — a 35-point improvement.

This Is Semi-Supervised Clustering

That "Attempt 2" photo-sorting strategy? That's semi-supervised clustering.

Core Idea

Small amount of supervision + unsupervised clustering = better results.

Supervision: Must-link/cannot-link constraints, or labeled samples (seeds)
Unsupervised clustering: Automatically discover natural groupings
Semi-supervised: Combine both — use a few hints to guide clustering direction

Why "Semi-Supervised"?

Fully supervised: Every sample labeled ("this is family," "this is scenery") — high cost

Unsupervised: No labels, purely feature-based — prone to drifting

Semi-supervised: Just a few hints (dozens suffice) guide the direction — balance cost and effectiveness

When to Use It

Some clues, but not many: Can't label all 1,000 photos, but can provide 30 hints
Categories not fully defined: Unsure how many piles (family photos might have "gatherings" and "travel" subcategories), let the algorithm discover
Relationship info is easy to get: Judging "do two photos look alike" is easier than "what category is this"

Regular Clustering: Blind Sorting, Easy to Drift

Quick review of how regular clustering works.

The Task

Mixed pile of fruit: apples, oranges, bananas. Nobody tells you which is which. Sort into 3 piles, each pile should be the same fruit.

How to Sort?

By features: shape (round/long), color (red/yellow/orange), skin (smooth/rough).

Round, red, smooth → one pile (probably apples)
Round, orange, rough → one pile (probably oranges)
Long, yellow → one pile (probably bananas)

Algorithm (k-means):

Randomly pick 3 fruits as "pile centers"
Assign each fruit to nearest center
Recalculate average center of each pile
Repeat 2-3 until piles stop changing

Sounds good. But.

The Problem: Easy to Misclassify

Scenario 1: Green apple similar color to orange → thrown into orange pile.

Scenario 2: Stubby banana shaped like potato → thrown into apple pile.

Scenario 3: Red orange exists → mixed with red apples.

Why? No "correct direction" hints. Algorithm only knows "put similar ones together," doesn't know what truly constitutes an "apple pile" or "orange pile."

Similar features ≠ same category. That's regular clustering's fatal flaw: pure guesswork, easy to drift.

Semi-Supervised Clustering: Add Hints, Sort Better

Core concept: small amount of supervision + unsupervised clustering = more accurate results.

"Small amount" = how much? Dozens of hints suffice. No need to label everything.

Hint Type 1: Must-Link / Cannot-Link Constraints

Tell the algorithm "which samples must be in same pile" and "which absolutely separate."

Must-Link

"These two must be in the same pile."

Examples:

Green apple and red apple — different colors but both apples → must-link
Round orange and oval orange — slightly different shapes but both oranges → must-link

Effect: Algorithm won't separate green and red apples due to color difference.

Cannot-Link

"These two absolutely cannot be in the same pile."

Examples:

Red apple and red orange — similar color but not same type → cannot-link
Stubby banana and potato — similar shape but not same type → cannot-link

Effect: Algorithm won't mix red apples and red oranges despite color similarity.

Hint Type 2: Labeled Samples (Seeds)

Directly tell the algorithm "this sample belongs to which pile."

Examples:

This round red fruit is an apple (belongs to "apple pile")
This orange rough fruit is an orange (belongs to "orange pile")
This long yellow fruit is a banana (belongs to "banana pile")

With seeds, the algorithm has "anchor points." Other samples expand around seeds.

Difference Between the Two

Type	Must-Link/Cannot-Link	Labeled Samples
Information	Relationships (pairwise)	Belonging (individual)
Flexibility	High (transitive: A-B, B-C → A-C)	Medium (direct pile assignment)
Annotation Cost	Low (just judge "similar or not")	Medium (need true category)

In practice, both are often combined.

Method 1: Constrained K-Means — Follow Rules While Sorting

Regular k-means: sort however, as long as within-pile similarity is high.

Constrained k-means: establish rules before sorting, can't break rules during sorting.

Step-by-Step Demo

Scenario: Sort 100 fruits (apples, oranges, bananas), k=3.

Known constraints:

Must-link: Apple A ↔ Apple B, Orange C ↔ Orange D
Cannot-link: Apple A ✗ Orange C, Orange C ✗ Banana E

Step 1: Random initialization

Randomly pick 3 fruits as centers:

Center 1: Apple A
Center 2: Orange C
Center 3: Banana E

Step 2: Assign each fruit (with constraint checking)

For Apple B:

Calculate distance to 3 centers
Nearest is Center 1 (Apple A)
Check constraints: Apple A ↔ Apple B (must-link) ✓ satisfied
Assign to Pile 1

For Apple F:

Nearest is Center 2 (Orange C)
Check constraints: Apple A ✗ Orange C (cannot-link), and Apple F might have transitive must-link with Apple A
Constraint violation! Can't assign to Pile 2
Choose second-nearest Center 1 (Apple A) instead

Step 3: Update pile centers

Each pile recalculates average center (average color, shape).

Step 4: Iterate

Repeat steps 2-3, checking constraints each assignment, until pile centers stop changing.

Key Points

Constraint transitivity: A must-link B, B must-link C → A must-link C (auto-inferred)

When constraints conflict:

Choose suboptimal pile (second-nearest center)
If all violate, choose minimum violation penalty

Pros: Leverages relationship info, doesn't need true categories

Cons: Constraint conflicts hard to handle (A must-link B, B cannot-link C, but A must-link C), slightly higher computation (constraint checking per assignment)

Method 2: Seeded K-Means — Use "Anchors" to Guide Sorting

More straightforward: place labeled samples in corresponding piles first, other samples "find their home."

Step-by-Step Demo

Scenario: Same 100 fruits, k=3.

Known seeds (labeled samples):

Apple pile seeds: Apple A, Apple B (known to be apples)
Orange pile seeds: Orange C, Orange D (known to be oranges)
Banana pile seeds: Banana E, Banana F (known to be bananas)

Step 1: Place seeds

Put 6 seeds into 3 piles:

Pile 1: [Apple A, Apple B]
Pile 2: [Orange C, Orange D]
Pile 3: [Banana E, Banana F]

Step 2: Calculate initial pile centers

Each pile computes average center:

Center 1 = (Apple A + Apple B) / 2
Center 2 = (Orange C + Orange D) / 2
Center 3 = (Banana E + Banana F) / 2

Step 3: Other fruits find piles

Remaining 94 fruits assigned one by one:

For Apple G:

Calculate distance to 3 centers
Closest to Center 1 (apple pile)
Assign to Pile 1

For Orange H:

Closest to Center 2 (orange pile)
Assign to Pile 2

Step 4: Update pile centers

After all fruits assigned, recalculate each pile's new center (including seeds + new members).

Step 5: Iterate and adjust

Repeat steps 3-4 until pile centers stabilize.

Key constraint: Seed samples' pile membership never changes. Apple A, B always stay in Pile 1, won't move to other piles.

Key Points

Seeds are anchors:

Accurate starting point (avoids bad initialization of regular k-means)
Stable direction (other samples cluster around seeds)

Pros: Simple, intuitive; few seeds needed (2-3 per pile); fast computation (no constraint checking)

Cons: Needs true categories (slightly higher annotation cost than must-link/cannot-link); poor seed selection affects results

Comparing the Two Methods

Dimension	Constrained K-Means	Seeded K-Means
Supervision	Must-link/cannot-link	Labeled samples
Annotation Cost	Low (judge similarity)	Medium (need category)
Computation	High (constraint checks)	Low (simple assignment)
Starting Quality	Random (possible bad start)	Good (seeds guarantee)
Best For	Relationship info (social networks, citations)	Few labels (image, text classification)

In practice:

Relationship info easy to get → constrained k-means
Have few labels → seeded k-means
Have both → combine

Key Takeaways

Regular clustering problem: Purely feature-based, easy to misclassify, lacks "correct direction."
Semi-supervised clustering core:
- Small amount of supervision (constraints or seeds)
- Guide clustering toward "correct direction"
- Don't need to label everything
Two main methods:
- Constrained k-means: Must-link/cannot-link, follow rules while sorting
- Seeded k-means: Labeled samples as anchors, others expand around
Applications: Photo organization, document clustering, customer segmentation — any "some hints, not many" scenario

Master clustering techniques

Explore semi-supervised learning methods, from constrained clustering to label propagation. Learn when to use supervision, how to design effective constraints, and build robust unsupervised models with minimal guidance.