You Have 1,000 Old Photos to Organize
Weekend cleanup. You find a pile of old photos — 1,000 of them, all mixed up. You want to sort them into "Family," "Scenery," and "Food" categories for archiving.
Look at them one by one? Massive undertaking. Can the computer help?
Attempt 1: Let the Computer Auto-Sort (Regular Clustering)
You tell the computer: "Help me split these photos into 3 piles. Put similar ones together."
The computer gets to work:
- Analyzes each photo's colors, composition, brightness
- Calculates which photos "look alike"
- Automatically creates 3 groups
10 minutes later: "Done!"
You check the results. Total mess.
Pile 1: Family selfies + restaurant food shots mixed together (both warm tones, close-up shots)
Pile 2: Beach scenery + family photos at the beach (both have blue sky, sand)
Pile 3: Architecture + city night scenes (both have straight lines, cool tones)
What went wrong? The computer only knows "put similar ones together," but has no idea you want "Family," "Scenery," "Food" — human-defined categories.
Similar colors and composition don't mean similar category.
Attempt 2: Give the Computer a Few Hints (Semi-Supervised Clustering)
You try a different approach. Tell the computer:
Hint 1 (Must-Link):
- "These two are both of my mom, must be in the same pile."
- "These two are both Huangshan scenery, same pile."
Hint 2 (Cannot-Link):
- "This one's my dad at a restaurant (portrait), this one's the food on the table (food photo), absolutely separate."
- "This beach scenery, this family at the beach — not the same type."
Hint 3 (Seed Samples):
- "These 5 are family photos."
- "These 5 are scenery photos."
- "These 5 are food photos."
Total: 30 hints (15 must-link/cannot-link pairs + 15 seed samples).
The computer re-sorts. This time:
- Places seed samples in corresponding piles (5 family → Pile A, 5 scenery → Pile B, 5 food → Pile C)
- Calculates each pile's "feature center" (average color, composition of family photos)
- Assigns the other 985 photos based on "closest to which center"
- During assignment: checks for must-link/cannot-link violations
Result? 95% accuracy.
Family photos with family. Scenery with scenery. Food with food. Occasionally a few edge cases (family at beach vs beach scenery) might be misplaced, but overall quality vastly better.
Comparing the Two Attempts
| Aspect | Attempt 1 (Regular Clustering) | Attempt 2 (Semi-Supervised) |
|---|---|---|
| Info Given | None (just "make 3 piles") | 30 hints (must-link/cannot-link + seeds) |
| Sorting Basis | Pure feature similarity (color, composition) | Feature similarity + human hints |
| Accuracy | ~60% (lots of confusion) | ~95% (mostly correct) |
| Manual Effort | 0 (fully automatic) | Very low (30 hints, 5 minutes) |
Key Finding
Providing hints for just 3% of photos (30/1000) boosts accuracy from 60% to 95% — a 35-point improvement.
This Is Semi-Supervised Clustering
That "Attempt 2" photo-sorting strategy? That's semi-supervised clustering.
Core Idea
Small amount of supervision + unsupervised clustering = better results.
- Supervision: Must-link/cannot-link constraints, or labeled samples (seeds)
- Unsupervised clustering: Automatically discover natural groupings
- Semi-supervised: Combine both — use a few hints to guide clustering direction
Why "Semi-Supervised"?
Fully supervised: Every sample labeled ("this is family," "this is scenery") — high cost
Unsupervised: No labels, purely feature-based — prone to drifting
Semi-supervised: Just a few hints (dozens suffice) guide the direction — balance cost and effectiveness
When to Use It
- Some clues, but not many: Can't label all 1,000 photos, but can provide 30 hints
- Categories not fully defined: Unsure how many piles (family photos might have "gatherings" and "travel" subcategories), let the algorithm discover
- Relationship info is easy to get: Judging "do two photos look alike" is easier than "what category is this"
Regular Clustering: Blind Sorting, Easy to Drift
Quick review of how regular clustering works.
The Task
Mixed pile of fruit: apples, oranges, bananas. Nobody tells you which is which. Sort into 3 piles, each pile should be the same fruit.
How to Sort?
By features: shape (round/long), color (red/yellow/orange), skin (smooth/rough).
- Round, red, smooth → one pile (probably apples)
- Round, orange, rough → one pile (probably oranges)
- Long, yellow → one pile (probably bananas)
Algorithm (k-means):
- Randomly pick 3 fruits as "pile centers"
- Assign each fruit to nearest center
- Recalculate average center of each pile
- Repeat 2-3 until piles stop changing
Sounds good. But.
The Problem: Easy to Misclassify
Scenario 1: Green apple similar color to orange → thrown into orange pile.
Scenario 2: Stubby banana shaped like potato → thrown into apple pile.
Scenario 3: Red orange exists → mixed with red apples.
Why? No "correct direction" hints. Algorithm only knows "put similar ones together," doesn't know what truly constitutes an "apple pile" or "orange pile."
Similar features ≠ same category. That's regular clustering's fatal flaw: pure guesswork, easy to drift.
Semi-Supervised Clustering: Add Hints, Sort Better
Core concept: small amount of supervision + unsupervised clustering = more accurate results.
"Small amount" = how much? Dozens of hints suffice. No need to label everything.
Hint Type 1: Must-Link / Cannot-Link Constraints
Tell the algorithm "which samples must be in same pile" and "which absolutely separate."
Must-Link
"These two must be in the same pile."
Examples:
- Green apple and red apple — different colors but both apples → must-link
- Round orange and oval orange — slightly different shapes but both oranges → must-link
Effect: Algorithm won't separate green and red apples due to color difference.
Cannot-Link
"These two absolutely cannot be in the same pile."
Examples:
- Red apple and red orange — similar color but not same type → cannot-link
- Stubby banana and potato — similar shape but not same type → cannot-link
Effect: Algorithm won't mix red apples and red oranges despite color similarity.
Hint Type 2: Labeled Samples (Seeds)
Directly tell the algorithm "this sample belongs to which pile."
Examples:
- This round red fruit is an apple (belongs to "apple pile")
- This orange rough fruit is an orange (belongs to "orange pile")
- This long yellow fruit is a banana (belongs to "banana pile")
With seeds, the algorithm has "anchor points." Other samples expand around seeds.
Difference Between the Two
| Type | Must-Link/Cannot-Link | Labeled Samples |
|---|---|---|
| Information | Relationships (pairwise) | Belonging (individual) |
| Flexibility | High (transitive: A-B, B-C → A-C) | Medium (direct pile assignment) |
| Annotation Cost | Low (just judge "similar or not") | Medium (need true category) |
In practice, both are often combined.
Method 1: Constrained K-Means — Follow Rules While Sorting
Regular k-means: sort however, as long as within-pile similarity is high.
Constrained k-means: establish rules before sorting, can't break rules during sorting.
Step-by-Step Demo
Scenario: Sort 100 fruits (apples, oranges, bananas), k=3.
Known constraints:
- Must-link: Apple A ↔ Apple B, Orange C ↔ Orange D
- Cannot-link: Apple A ✗ Orange C, Orange C ✗ Banana E
Step 1: Random initialization
Randomly pick 3 fruits as centers:
- Center 1: Apple A
- Center 2: Orange C
- Center 3: Banana E
Step 2: Assign each fruit (with constraint checking)
For Apple B:
- Calculate distance to 3 centers
- Nearest is Center 1 (Apple A)
- Check constraints: Apple A ↔ Apple B (must-link) ✓ satisfied
- Assign to Pile 1
For Apple F:
- Nearest is Center 2 (Orange C)
- Check constraints: Apple A ✗ Orange C (cannot-link), and Apple F might have transitive must-link with Apple A
- Constraint violation! Can't assign to Pile 2
- Choose second-nearest Center 1 (Apple A) instead
Step 3: Update pile centers
Each pile recalculates average center (average color, shape).
Step 4: Iterate
Repeat steps 2-3, checking constraints each assignment, until pile centers stop changing.
Key Points
Constraint transitivity: A must-link B, B must-link C → A must-link C (auto-inferred)
When constraints conflict:
- Choose suboptimal pile (second-nearest center)
- If all violate, choose minimum violation penalty
Pros: Leverages relationship info, doesn't need true categories
Cons: Constraint conflicts hard to handle (A must-link B, B cannot-link C, but A must-link C), slightly higher computation (constraint checking per assignment)
Method 2: Seeded K-Means — Use "Anchors" to Guide Sorting
More straightforward: place labeled samples in corresponding piles first, other samples "find their home."
Step-by-Step Demo
Scenario: Same 100 fruits, k=3.
Known seeds (labeled samples):
- Apple pile seeds: Apple A, Apple B (known to be apples)
- Orange pile seeds: Orange C, Orange D (known to be oranges)
- Banana pile seeds: Banana E, Banana F (known to be bananas)
Step 1: Place seeds
Put 6 seeds into 3 piles:
- Pile 1: [Apple A, Apple B]
- Pile 2: [Orange C, Orange D]
- Pile 3: [Banana E, Banana F]
Step 2: Calculate initial pile centers
Each pile computes average center:
- Center 1 = (Apple A + Apple B) / 2
- Center 2 = (Orange C + Orange D) / 2
- Center 3 = (Banana E + Banana F) / 2
Step 3: Other fruits find piles
Remaining 94 fruits assigned one by one:
For Apple G:
- Calculate distance to 3 centers
- Closest to Center 1 (apple pile)
- Assign to Pile 1
For Orange H:
- Closest to Center 2 (orange pile)
- Assign to Pile 2
Step 4: Update pile centers
After all fruits assigned, recalculate each pile's new center (including seeds + new members).
Step 5: Iterate and adjust
Repeat steps 3-4 until pile centers stabilize.
Key constraint: Seed samples' pile membership never changes. Apple A, B always stay in Pile 1, won't move to other piles.
Key Points
Seeds are anchors:
- Accurate starting point (avoids bad initialization of regular k-means)
- Stable direction (other samples cluster around seeds)
Pros: Simple, intuitive; few seeds needed (2-3 per pile); fast computation (no constraint checking)
Cons: Needs true categories (slightly higher annotation cost than must-link/cannot-link); poor seed selection affects results
Comparing the Two Methods
| Dimension | Constrained K-Means | Seeded K-Means |
|---|---|---|
| Supervision | Must-link/cannot-link | Labeled samples |
| Annotation Cost | Low (judge similarity) | Medium (need category) |
| Computation | High (constraint checks) | Low (simple assignment) |
| Starting Quality | Random (possible bad start) | Good (seeds guarantee) |
| Best For | Relationship info (social networks, citations) | Few labels (image, text classification) |
In practice:
- Relationship info easy to get → constrained k-means
- Have few labels → seeded k-means
- Have both → combine
Key Takeaways
- Regular clustering problem: Purely feature-based, easy to misclassify, lacks "correct direction."
- Semi-supervised clustering core:
- Small amount of supervision (constraints or seeds)
- Guide clustering toward "correct direction"
- Don't need to label everything
- Two main methods:
- Constrained k-means: Must-link/cannot-link, follow rules while sorting
- Seeded k-means: Labeled samples as anchors, others expand around
- Applications: Photo organization, document clustering, customer segmentation — any "some hints, not many" scenario