Master spectral clustering, Gaussian mixture models, and modern clustering techniques
Build affinity matrix W where:
Gaussian similarity kernel with scale parameter σ. Can also use k-NN or ε-neighborhood graphs.
Unnormalized Laplacian:
where D is degree matrix: D_ii = Σ_j W_ij
Normalized Laplacian (symmetric):
Normalized Laplacian (random walk):
Theorem: For graph with k connected components, L has exactly k eigenvalues equal to 0.
Proof sketch:
• For connected component C_j, indicator vector 1_C_j satisfies L1_C_j = 0
• These k vectors span nullspace of L
• All other eigenvalues > 0 (L is positive semi-definite)
1. Compute affinity matrix W
2. Compute Laplacian: L = D - W (or normalized)
3. Compute first k eigenvectors v_1,...,v_k of L
4. Form matrix U = [v_1 | ... | v_k] ∈ R^(n×k)
5. Normalize rows: Y_i = U_i / ||U_i||
6. Cluster rows of Y using k-means
7. Assign x_i to cluster j if row i assigned to j
Normalized cut objective:
Theorem: Minimizing NCut is NP-hard, but spectral clustering solves relaxed version via eigenvectors. First k eigenvectors give approximate solution.
Parameters: π_k (mixing weights, Σπ_k=1), μ_k (means), Σ_k (covariances)
Posterior probability (responsibility) that point i belongs to cluster k:
This is Bayes' rule: posterior ∝ prior × likelihood
Effective number of points in cluster k:
Update mixing weights:
Update means:
Update covariances:
Theorem: EM algorithm for GMM monotonically increases log-likelihood:
Converges to local optimum. Sensitive to initialization (use k-means++ init).
K-Means
GMM
Estimate density at point x:
where K_h is kernel with bandwidth h (e.g., Gaussian kernel)
Gradient of density (using Gaussian kernel):
Mean shift vector m(x) points toward density increase:
For each point x_i:
1. Initialize: y ← x_i
2. Repeat until convergence:
Compute mean of points in window around y
y ← weighted mean (mean shift step)
3. Store mode (local maximum) y
Merge nearby modes → cluster centers
Advantages:
Disadvantages:
Complete-data log-likelihood (if we knew cluster assignments z):
Since z unknown, take expectation w.r.t. posterior q(z):
Fix θ, maximize L w.r.t. q(z). Optimal q is posterior:
Mixing weights π_k:
Lagrangian with constraint Σπ_k = 1:
Means μ_k:
Maximize weighted Gaussian likelihoods:
Covariances Σ_k:
Theorem: EM for GMM monotonically increases log-likelihood and converges to local optimum.
Input: Similarity matrix S where S(i,k) measures how well point k serves as exemplar for point i
Goal: Maximize total similarity
Responsibility r(i,k):
How well-suited k is to be exemplar for i
Availability a(i,k):
How appropriate it is for i to choose k as exemplar
Special case: a(k,k) = Σ_(i'≠k) max 0, r(i',k) (self-availability)
1. Initialize: a(i,k) = 0 for all i,k
2. Repeat until convergence:
Update responsibilities: r(i,k) for all i,k
Update availabilities: a(i,k) for all i,k
Apply damping: r_new = λ·r_old + (1-λ)·r_new
3. For each i: exemplar(i) = argmax_k [a(i,k) + r(i,k)]
4. Form clusters from exemplars
Advantages:
Complexity:
Merge clusters that minimize increase in total within-cluster variance:
Equivalent to minimizing sum of squared distances to cluster centroids. Often gives best results!
Problem: Cluster "two moons" dataset where k-means fails
Step 1: Compute RBF similarity: W_ij = exp(-||x_i - x_j||²/0.5)
Step 2: Build normalized Laplacian: L_sym = I - D^(-1/2) W D^(-1/2)
Step 3: Compute first 2 eigenvectors v_1, v_2
Step 4: Apply k-means (k=2) on [v_1 | v_2]
Result: Perfect separation of two moons! K-means on original space fails.
Sort eigenvalues: 0 = λ_1 ≤ λ_2 ≤ ... ≤ λ_n
Choose k where eigengap λ_(k+1) - λ_k is largest
Intuition:
For k well-separated clusters, first k eigenvalues are small (near 0), then large jump to λ_(k+1)
Unnormalized L
Normalized L_sym
Bottleneck: Eigendecomposition of n×n matrix
For large graphs (n > 10K), use approximate methods!
• L = likelihood, p = number of parameters, n = number of samples
• For GMM: p = K(d + d(d+1)/2 + 1) - 1
• Choose K with lowest BIC
• Less penalty than BIC → tends to select larger K
• BIC preferred for model selection, AIC for prediction
Compute silhouette score for each K, choose K with highest average score. Model-agnostic alternative to BIC/AIC.
Cluster centers characterized by:
or simply: ρ_i = number of neighbors within cutoff d_c
Plot ρ vs δ: points with both high → cluster centers!
1. Compute ρ_i for all points
2. Compute δ_i for all points
3. Select centers: points with γ_i = ρ_i × δ_i above threshold
4. Assign remaining points to nearest center with higher density
Test your understanding with 10 multiple-choice questions