Master fundamental clustering algorithms: k-means, hierarchical clustering, and DBSCAN
Better initialization strategy to avoid poor local minima:
Single Linkage (Min)
✗ Chain effect (long, thin clusters)
Complete Linkage (Max)
✓ Compact clusters, less sensitive to outliers
Average Linkage
✓ Balanced, popular choice
Centroid/Ward Linkage
Distance between centroids or minimize within-cluster variance
✓ Often best performance
Core Point
≥ MinPts neighbors within
Border Point
< MinPts neighbors but reachable from core
Noise Point
Neither core nor border
(eps)
Neighborhood radius. Use k-distance plot to choose: plot distance to -th nearest neighbor, look for 'elbow'
MinPts
Minimum points to form dense region. Rule of thumb: MinPts ≥ dimensionality + 1, often 5-10 works well
Silhouette Coefficient
= avg dist to same cluster, = avg dist to nearest cluster. Range [-1, 1], higher better.
Davies-Bouldin Index
Ratio of within-cluster to between-cluster distances. Lower is better.
Calinski-Harabasz Index
Ratio of between-cluster variance to within-cluster variance. Higher is better.
Adjusted Rand Index (ARI)
Similarity between true and predicted clustering. Range [-1, 1], 1 = perfect match, 0 = random.
Normalized Mutual Information (NMI)
Mutual information between cluster and class labels, normalized. Range [0, 1], 1 = perfect.
Purity
Fraction of correctly clustered points (assuming majority label per cluster). Range [0, 1].
• Annual spending, • Purchase frequency, • Average order value, • Days since last purchase
| Method | Clusters Found | Silhouette Score | Insights |
|---|---|---|---|
| K-means (k=4) | 4 | 0.52 | VIP, Regular, Occasional, Inactive |
| Hierarchical (Ward) | 5 | 0.48 | More granular segments |
| DBSCAN | 3 + noise | 0.55 | Identified 50 outliers (fraud?) |
DBSCAN's ability to detect outliers revealed potentially fraudulent accounts. K-means provided clean segments for targeted marketing campaigns. Each method offers different insights!
Within-Cluster Sum of Squares:
K-means minimizes J, but finding global optimum is NP-hard.
Theorem: K-means algorithm converges in finite iterations.
Proof:
Step 1 (Assignment step decreases J):
Assigning x_i to nearest centroid μ_k minimizes ||x_i - μ_k||². Therefore J(new assignments) ≤ J(old).
Step 2 (Update step decreases J):
New centroid μ_k = (1/|C_k|)Σ_(x∈C_k) x minimizes Σ||x - μ_k||² (sample mean minimizes squared distances).
Step 3 (Monotonic decrease):
J decreases at each step (or stays same). J is bounded below by 0.
Step 4 (Finite partitions):
Only finitely many ways to partition n points into K clusters. Since J strictly decreases unless converged, cannot revisit same partition. Must converge in finite steps. ∎
Per iteration:
Total complexity:
where I = number of iterations (typically <100)
Theorem (Arthur & Vassilvitskii, 2007):
Expected solution is O(log K)-competitive with optimal! Much better than random initialization.
Intuition: Points far from existing centers have higher chance of being selected.
ε-Neighborhood:
Set of points within distance ε of p
Core Point:
Point p is a core point if |N_ε(p)| ≥ MinPts
Directly Density-Reachable:
q is directly density-reachable from p if:
Density-Reachable (transitive):
q is density-reachable from p if there exists chain p_1,...,p_n where:
p_1 = p, p_n = q, and p_(i+1) is directly density-reachable from p_i
Density-Connected:
p and q are density-connected if ∃ core point o such that both p and q are density-reachable from o
Formal Definition: A cluster C w.r.t. ε and MinPts satisfies:
Naive: For each point, find ε-neighborhood → O(n²)
With spatial index (R*-tree, KD-tree): O(n log n)
Algorithm steps:
For point i in cluster C_I:
a(i) - Average distance to points in same cluster:
b(i) - Average distance to nearest other cluster:
Silhouette coefficient:
s(i) ≈ 1
a(i) << b(i)
Well-clustered
s(i) ≈ 0
a(i) ≈ b(i)
On border
s(i) ≈ -1
a(i) > b(i)
Mis-clustered
Used to evaluate clustering quality and choose optimal K. Higher is better.
Complexity:
O(n²) - need to compute all pairwise distances
Test your understanding with 10 multiple-choice questions