MathIsimple
Back to Machine Learning
Advanced Methods
Advanced Level

Advanced Clustering

Master spectral clustering, Gaussian mixture models, and modern clustering techniques

Spectral Clustering Mathematical Foundation

Graph Laplacian Eigendecomposition

Graph Construction

Build affinity matrix W where:

Wij={exp(xixj2/2σ2)if ij0if i=jW_{ij} = \begin{cases} \exp(-\|x_i - x_j\|^2 / 2\sigma^2) & \text{if } i \neq j \\ 0 & \text{if } i = j \end{cases}

Gaussian similarity kernel with scale parameter σ. Can also use k-NN or ε-neighborhood graphs.

Graph Laplacian Matrices

Unnormalized Laplacian:

L=DWL = D - W

where D is degree matrix: D_ii = Σ_j W_ij

Normalized Laplacian (symmetric):

Lsym=D1/2LD1/2=ID1/2WD1/2L_{\text{sym}} = D^{-1/2}LD^{-1/2} = I - D^{-1/2}WD^{-1/2}

Normalized Laplacian (random walk):

Lrw=D1L=ID1WL_{\text{rw}} = D^{-1}L = I - D^{-1}W

Key Spectral Properties

Theorem: For graph with k connected components, L has exactly k eigenvalues equal to 0.

Proof sketch:

• For connected component C_j, indicator vector 1_C_j satisfies L1_C_j = 0

• These k vectors span nullspace of L

• All other eigenvalues > 0 (L is positive semi-definite)

Spectral Clustering Algorithm

1. Compute affinity matrix W

2. Compute Laplacian: L = D - W (or normalized)

3. Compute first k eigenvectors v_1,...,v_k of L

4. Form matrix U = [v_1 | ... | v_k] ∈ R^(n×k)

5. Normalize rows: Y_i = U_i / ||U_i||

6. Cluster rows of Y using k-means

7. Assign x_i to cluster j if row i assigned to j

Why It Works: Normalized Cut

Normalized cut objective:

NCut(A1,,Ak)=i=1kcut(Ai,Aˉi)vol(Ai)\text{NCut}(A_1,\ldots,A_k) = \sum_{i=1}^{k} \frac{\text{cut}(A_i, \bar{A}_i)}{\text{vol}(A_i)}

Theorem: Minimizing NCut is NP-hard, but spectral clustering solves relaxed version via eigenvectors. First k eigenvectors give approximate solution.

Gaussian Mixture Model & EM Algorithm

Complete EM Derivation for GMM

GMM Model

p(x)=k=1KπkN(xμk,Σk)p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

Parameters: π_k (mixing weights, Σπ_k=1), μ_k (means), Σ_k (covariances)

E-Step: Compute Responsibilities

Posterior probability (responsibility) that point i belongs to cluster k:

γik=P(zi=kxi)=πkN(xiμk,Σk)j=1KπjN(xiμj,Σj)\gamma_{ik} = P(z_i=k|\mathbf{x}_i) = \frac{\pi_k \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

This is Bayes' rule: posterior ∝ prior × likelihood

M-Step: Update Parameters

Effective number of points in cluster k:

Nk=i=1nγikN_k = \sum_{i=1}^{n} \gamma_{ik}

Update mixing weights:

πknew=Nkn\pi_k^{\text{new}} = \frac{N_k}{n}

Update means:

μknew=1Nki=1nγikxi\boldsymbol{\mu}_k^{\text{new}} = \frac{1}{N_k}\sum_{i=1}^{n} \gamma_{ik} \mathbf{x}_i

Update covariances:

Σknew=1Nki=1nγik(xiμknew)(xiμknew)T\boldsymbol{\Sigma}_k^{\text{new}} = \frac{1}{N_k}\sum_{i=1}^{n} \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})(\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})^T

Convergence Guarantee

Theorem: EM algorithm for GMM monotonically increases log-likelihood:

logp(Xθt+1)logp(Xθt)\log p(\mathbf{X}|\theta^{t+1}) \geq \log p(\mathbf{X}|\theta^t)

Converges to local optimum. Sensitive to initialization (use k-means++ init).

GMM vs K-Means

K-Means

  • • Hard assignments
  • • Spherical clusters
  • • Distance-based
  • • Faster

GMM

  • • Soft assignments (probabilities)
  • • Elliptical clusters
  • • Probabilistic model
  • • More parameters

Mean Shift Algorithm

Gradient Ascent in Density Space

Kernel Density Estimation

Estimate density at point x:

f(x)=1ni=1nKh(xxi)f(\mathbf{x}) = \frac{1}{n}\sum_{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}_i)

where K_h is kernel with bandwidth h (e.g., Gaussian kernel)

Mean Shift Vector

Gradient of density (using Gaussian kernel):

f(x)i=1n(xix)K(xxi)\nabla f(\mathbf{x}) \propto \sum_{i=1}^{n} (\mathbf{x}_i - \mathbf{x}) K(\mathbf{x} - \mathbf{x}_i)

Mean shift vector m(x) points toward density increase:

m(x)=ixiK(xxi)iK(xxi)x\mathbf{m}(\mathbf{x}) = \frac{\sum_{i} \mathbf{x}_i K(\mathbf{x} - \mathbf{x}_i)}{\sum_{i} K(\mathbf{x} - \mathbf{x}_i)} - \mathbf{x}

Algorithm

For each point x_i:

1. Initialize: y ← x_i

2. Repeat until convergence:

Compute mean of points in window around y

y ← weighted mean (mean shift step)

3. Store mode (local maximum) y

Merge nearby modes → cluster centers

Properties

Advantages:

  • • No need to specify K
  • • Finds arbitrary shape clusters
  • • Non-parametric

Disadvantages:

  • • Computationally expensive O(n²T)
  • • Sensitive to bandwidth h
  • • Doesn't scale well

GMM: Complete EM Derivation

Expectation-Maximization for Mixture Models

Lower Bound (ELBO)

Complete-data log-likelihood (if we knew cluster assignments z):

logP(X,Zθ)=ikzik[logπk+logN(xiμk,Σk)]\log P(\mathbf{X}, \mathbf{Z}|\theta) = \sum_i \sum_k z_{ik}[\log \pi_k + \log \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)]

Since z unknown, take expectation w.r.t. posterior q(z):

L(q,θ)=Eq[logP(X,Zθ)]Eq[logq(Z)]\mathcal{L}(q, \theta) = \mathbb{E}_q[\log P(\mathbf{X}, \mathbf{Z}|\theta)] - \mathbb{E}_q[\log q(\mathbf{Z})]

E-Step Derivation

Fix θ, maximize L w.r.t. q(z). Optimal q is posterior:

q(zik=1)=γik=πkN(xiμk,Σk)jπjN(xiμj,Σj)q(z_{ik}=1) = \gamma_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_j \pi_j \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

M-Step Derivation for Parameters

Mixing weights π_k:

Lagrangian with constraint Σπ_k = 1:

πk=Nkn where Nk=iγik\pi_k = \frac{N_k}{n} \text{ where } N_k = \sum_i \gamma_{ik}

Means μ_k:

Maximize weighted Gaussian likelihoods:

μk=iγikxiNk\boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik}\mathbf{x}_i}{N_k}

Covariances Σ_k:

Σk=iγik(xiμk)(xiμk)TNk\boldsymbol{\Sigma}_k = \frac{\sum_i \gamma_{ik}(\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T}{N_k}

Convergence Properties

Theorem: EM for GMM monotonically increases log-likelihood and converges to local optimum.

  • • Convergence guaranteed but to local optimum
  • • Sensitive to initialization (use k-means++ to init)
  • • Can converge to boundary (singular covariance) → add regularization

Affinity Propagation Message Passing

Exemplar-Based Clustering via Message Passing

Problem Formulation

Input: Similarity matrix S where S(i,k) measures how well point k serves as exemplar for point i

Goal: Maximize total similarity

iS(i,exemplar(i))\sum_{i} S(i, \text{exemplar}(i))

Message Passing Equations

Responsibility r(i,k):

How well-suited k is to be exemplar for i

r(i,k)S(i,k)maxkk{a(i,k)+S(i,k)}r(i,k) \leftarrow S(i,k) - \max_{k' \neq k} \{a(i,k') + S(i,k')\}

Availability a(i,k):

How appropriate it is for i to choose k as exemplar

a(i,k)min{0,r(k,k)+i{i,k}max{0,r(i,k)}}a(i,k) \leftarrow \min\left\{0, r(k,k) + \sum_{i' \notin \{i,k\}} \max\{0, r(i',k)\}\right\}

Special case: a(k,k) = Σ_(i'≠k) max 0, r(i',k) (self-availability)

Algorithm

1. Initialize: a(i,k) = 0 for all i,k

2. Repeat until convergence:

Update responsibilities: r(i,k) for all i,k

Update availabilities: a(i,k) for all i,k

Apply damping: r_new = λ·r_old + (1-λ)·r_new

3. For each i: exemplar(i) = argmax_k [a(i,k) + r(i,k)]

4. Form clusters from exemplars

Key Properties

Advantages:

  • • Automatically determines K
  • • Considers all points as potential exemplars
  • • Works with any similarity measure
  • • No random initialization

Complexity:

  • • Time: O(n²T) per iteration
  • • Space: O(n²) for messages
  • • T typically 10-200 iterations

Hierarchical Clustering Mathematical Analysis

Linkage Criteria Comparison

Single Linkage (MIN)

d(Ci,Cj)=minxCi,yCjd(x,y)d(C_i, C_j) = \min_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})
  • ✓ Can find non-convex clusters
  • ✗ Sensitive to noise, chaining effect
  • • Complexity: O(n² log n) with efficient data structures

Complete Linkage (MAX)

d(Ci,Cj)=maxxCi,yCjd(x,y)d(C_i, C_j) = \max_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})
  • ✓ Prefers compact, spherical clusters
  • ✓ Robust to noise
  • ✗ Breaks large clusters

Average Linkage (UPGMA)

d(Ci,Cj)=1CiCjxCiyCjd(x,y)d(C_i, C_j) = \frac{1}{|C_i||C_j|}\sum_{\mathbf{x} \in C_i}\sum_{\mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})
  • ✓ Balance between single and complete
  • ✓ Good general-purpose choice
  • • Most commonly used in practice

Ward's Method (Minimum Variance)

Merge clusters that minimize increase in total within-cluster variance:

Δ(Ci,Cj)=CiCjCi+Cjμiμj2\Delta(C_i, C_j) = \frac{|C_i||C_j|}{|C_i|+|C_j|}\|\boldsymbol{\mu}_i - \boldsymbol{\mu}_j\|^2

Equivalent to minimizing sum of squared distances to cluster centroids. Often gives best results!

Representative Example

Spectral Clustering on Non-Convex Data

Problem: Cluster "two moons" dataset where k-means fails

Step 1: Compute RBF similarity: W_ij = exp(-||x_i - x_j||²/0.5)

Step 2: Build normalized Laplacian: L_sym = I - D^(-1/2) W D^(-1/2)

Step 3: Compute first 2 eigenvectors v_1, v_2

Step 4: Apply k-means (k=2) on [v_1 | v_2]

Result: Perfect separation of two moons! K-means on original space fails.

Spectral Clustering: Advanced Topics

Choosing Number of Clusters via Eigengap

Eigengap Heuristic

Sort eigenvalues: 0 = λ_1 ≤ λ_2 ≤ ... ≤ λ_n

Choose k where eigengap λ_(k+1) - λ_k is largest

Intuition:

For k well-separated clusters, first k eigenvalues are small (near 0), then large jump to λ_(k+1)

Normalized vs Unnormalized Laplacian

Unnormalized L

  • • Sensitive to cluster sizes
  • • Prefers balanced cuts
  • • Eigenvalues not bounded

Normalized L_sym

  • • Handles varying cluster sizes
  • • More robust in practice
  • • Eigenvalues in [0,2]

Computational Efficiency

Bottleneck: Eigendecomposition of n×n matrix

  • • Full eigendecomposition: O(n³)
  • • Sparse matrices (k-NN graphs): O(nk²) using Lanczos/Arnoldi
  • • Nyström approximation: O(m²n) where m ≪ n (landmark points)

For large graphs (n > 10K), use approximate methods!

GMM: Model Selection

Choosing Number of Components K

Bayesian Information Criterion (BIC)

BIC=2logL(θ^)+plogn\text{BIC} = -2\log \mathcal{L}(\hat{\theta}) + p\log n

• L = likelihood, p = number of parameters, n = number of samples
• For GMM: p = K(d + d(d+1)/2 + 1) - 1
• Choose K with lowest BIC

Akaike Information Criterion (AIC)

AIC=2logL(θ^)+2p\text{AIC} = -2\log \mathcal{L}(\hat{\theta}) + 2p

• Less penalty than BIC → tends to select larger K
• BIC preferred for model selection, AIC for prediction

Silhouette Score

Compute silhouette score for each K, choose K with highest average score. Model-agnostic alternative to BIC/AIC.

Density Peaks Clustering

Finding Cluster Centers by Density and Distance

Key Idea

Cluster centers characterized by:

  1. High local density ρ_i
  2. Large distance δ_i to nearest point with higher density

Density Computation

ρi=jexp(dij2dc2)\rho_i = \sum_{j} \exp\left(-\frac{d_{ij}^2}{d_c^2}\right)

or simply: ρ_i = number of neighbors within cutoff d_c

Distance to Higher Density

δi={minj:ρj>ρidijif j:ρj>ρimaxjdijotherwise\delta_i = \begin{cases} \min_{j:\rho_j > \rho_i} d_{ij} & \text{if } \exists j: \rho_j > \rho_i \\ \max_j d_{ij} & \text{otherwise} \end{cases}

Plot ρ vs δ: points with both high → cluster centers!

Algorithm Steps

1. Compute ρ_i for all points

2. Compute δ_i for all points

3. Select centers: points with γ_i = ρ_i × δ_i above threshold

4. Assign remaining points to nearest center with higher density

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
Spectral clustering uses:
Not attempted
2
What does GMM (Gaussian Mixture Model) assume?
Not attempted
3
The EM algorithm for GMM alternates between:
Not attempted
4
Compared to k-means, GMM:
Not attempted
5
In spectral clustering, the unnormalized graph Laplacian is:
Not attempted
6
Affinity propagation determines clusters by:
Not attempted
7
Mean shift clustering:
Not attempted
8
Consensus (ensemble) clustering combines:
Not attempted
9
When is spectral clustering preferred over k-means?
Not attempted
10
What is the main challenge with GMM?
Not attempted