MathIsimple – Simple, Friendly Math Tools & Learning

Spectral Clustering Mathematical Foundation

Graph Laplacian Eigendecomposition

Graph Construction

Build affinity matrix W where:

W_{ij} = \begin{cases} \exp(-\|x_i - x_j\|^2 / 2\sigma^2) & \text{if } i \neq j \\ 0 & \text{if } i = j \end{cases}

Gaussian similarity kernel with scale parameter σ. Can also use k-NN or ε-neighborhood graphs.

Graph Laplacian Matrices

Unnormalized Laplacian:

L = D - W

where D is degree matrix: D_ii = Σ_j W_ij

Normalized Laplacian (symmetric):

L_{\text{sym}} = D^{-1/2}LD^{-1/2} = I - D^{-1/2}WD^{-1/2}

Normalized Laplacian (random walk):

L_{\text{rw}} = D^{-1}L = I - D^{-1}W

Key Spectral Properties

Theorem: For graph with k connected components, L has exactly k eigenvalues equal to 0.

Proof sketch:

• For connected component C_j, indicator vector 1_C_j satisfies L1_C_j = 0

• These k vectors span nullspace of L

• All other eigenvalues > 0 (L is positive semi-definite)

Spectral Clustering Algorithm

1. Compute affinity matrix W

2. Compute Laplacian: L = D - W (or normalized)

3. Compute first k eigenvectors v_1,...,v_k of L

4. Form matrix U = [v_1 | ... | v_k] ∈ R^(n×k)

5. Normalize rows: Y_i = U_i / ||U_i||

6. Cluster rows of Y using k-means

7. Assign x_i to cluster j if row i assigned to j

Why It Works: Normalized Cut

Normalized cut objective:

\text{NCut}(A_1,\ldots,A_k) = \sum_{i=1}^{k} \frac{\text{cut}(A_i, \bar{A}_i)}{\text{vol}(A_i)}

Theorem: Minimizing NCut is NP-hard, but spectral clustering solves relaxed version via eigenvectors. First k eigenvectors give approximate solution.

Gaussian Mixture Model & EM Algorithm

Complete EM Derivation for GMM

GMM Model

p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

Parameters: π_k (mixing weights, Σπ_k=1), μ_k (means), Σ_k (covariances)

E-Step: Compute Responsibilities

Posterior probability (responsibility) that point i belongs to cluster k:

\gamma_{ik} = P(z_i=k|\mathbf{x}_i) = \frac{\pi_k \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

This is Bayes' rule: posterior ∝ prior × likelihood

M-Step: Update Parameters

Effective number of points in cluster k:

N_k = \sum_{i=1}^{n} \gamma_{ik}

Update mixing weights:

\pi_k^{\text{new}} = \frac{N_k}{n}

Update means:

\boldsymbol{\mu}_k^{\text{new}} = \frac{1}{N_k}\sum_{i=1}^{n} \gamma_{ik} \mathbf{x}_i

Update covariances:

\boldsymbol{\Sigma}_k^{\text{new}} = \frac{1}{N_k}\sum_{i=1}^{n} \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})(\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})^T

Convergence Guarantee

Theorem: EM algorithm for GMM monotonically increases log-likelihood:

\log p(\mathbf{X}|\theta^{t+1}) \geq \log p(\mathbf{X}|\theta^t)

Converges to local optimum. Sensitive to initialization (use k-means++ init).

GMM vs K-Means

K-Means

• Hard assignments
• Spherical clusters
• Distance-based
• Faster

GMM

• Soft assignments (probabilities)
• Elliptical clusters
• Probabilistic model
• More parameters

Mean Shift Algorithm

Gradient Ascent in Density Space

Kernel Density Estimation

Estimate density at point x:

f(\mathbf{x}) = \frac{1}{n}\sum_{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}_i)

where K_h is kernel with bandwidth h (e.g., Gaussian kernel)

Mean Shift Vector

Gradient of density (using Gaussian kernel):

\nabla f(\mathbf{x}) \propto \sum_{i=1}^{n} (\mathbf{x}_i - \mathbf{x}) K(\mathbf{x} - \mathbf{x}_i)

Mean shift vector m(x) points toward density increase:

\mathbf{m}(\mathbf{x}) = \frac{\sum_{i} \mathbf{x}_i K(\mathbf{x} - \mathbf{x}_i)}{\sum_{i} K(\mathbf{x} - \mathbf{x}_i)} - \mathbf{x}

Algorithm

For each point x_i:

1. Initialize: y ← x_i

2. Repeat until convergence:

Compute mean of points in window around y

y ← weighted mean (mean shift step)

3. Store mode (local maximum) y

Merge nearby modes → cluster centers

Properties

Advantages:

• No need to specify K
• Finds arbitrary shape clusters
• Non-parametric

Disadvantages:

• Computationally expensive O(n²T)
• Sensitive to bandwidth h
• Doesn't scale well

GMM: Complete EM Derivation

Expectation-Maximization for Mixture Models

Lower Bound (ELBO)

Complete-data log-likelihood (if we knew cluster assignments z):

\log P(\mathbf{X}, \mathbf{Z}|\theta) = \sum_i \sum_k z_{ik}[\log \pi_k + \log \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)]

Since z unknown, take expectation w.r.t. posterior q(z):

\mathcal{L}(q, \theta) = \mathbb{E}_q[\log P(\mathbf{X}, \mathbf{Z}|\theta)] - \mathbb{E}_q[\log q(\mathbf{Z})]

E-Step Derivation

Fix θ, maximize L w.r.t. q(z). Optimal q is posterior:

q(z_{ik}=1) = \gamma_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_j \pi_j \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

M-Step Derivation for Parameters

Mixing weights π_k:

Lagrangian with constraint Σπ_k = 1:

\pi_k = \frac{N_k}{n} \text{ where } N_k = \sum_i \gamma_{ik}

Means μ_k:

Maximize weighted Gaussian likelihoods:

\boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik}\mathbf{x}_i}{N_k}

Covariances Σ_k:

\boldsymbol{\Sigma}_k = \frac{\sum_i \gamma_{ik}(\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T}{N_k}

Convergence Properties

Theorem: EM for GMM monotonically increases log-likelihood and converges to local optimum.

• Convergence guaranteed but to local optimum
• Sensitive to initialization (use k-means++ to init)
• Can converge to boundary (singular covariance) → add regularization

Affinity Propagation Message Passing

Exemplar-Based Clustering via Message Passing

Problem Formulation

Input: Similarity matrix S where S(i,k) measures how well point k serves as exemplar for point i

Goal: Maximize total similarity

\sum_{i} S(i, \text{exemplar}(i))

Message Passing Equations

Responsibility r(i,k):

How well-suited k is to be exemplar for i

r(i,k) \leftarrow S(i,k) - \max_{k' \neq k} \{a(i,k') + S(i,k')\}

Availability a(i,k):

How appropriate it is for i to choose k as exemplar

a(i,k) \leftarrow \min\left\{0, r(k,k) + \sum_{i' \notin \{i,k\}} \max\{0, r(i',k)\}\right\}

Special case: a(k,k) = Σ_(i'≠k) max 0, r(i',k) (self-availability)

Algorithm

1. Initialize: a(i,k) = 0 for all i,k

2. Repeat until convergence:

Update responsibilities: r(i,k) for all i,k

Update availabilities: a(i,k) for all i,k

Apply damping: r_new = λ·r_old + (1-λ)·r_new

3. For each i: exemplar(i) = argmax_k [a(i,k) + r(i,k)]

4. Form clusters from exemplars

Key Properties

Advantages:

• Automatically determines K
• Considers all points as potential exemplars
• Works with any similarity measure
• No random initialization

Complexity:

• Time: O(n²T) per iteration
• Space: O(n²) for messages
• T typically 10-200 iterations

Hierarchical Clustering Mathematical Analysis

Linkage Criteria Comparison

Single Linkage (MIN)

d(C_i, C_j) = \min_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})

✓ Can find non-convex clusters
✗ Sensitive to noise, chaining effect
• Complexity: O(n² log n) with efficient data structures

Complete Linkage (MAX)

d(C_i, C_j) = \max_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})

✓ Prefers compact, spherical clusters
✓ Robust to noise
✗ Breaks large clusters

Average Linkage (UPGMA)

d(C_i, C_j) = \frac{1}{|C_i||C_j|}\sum_{\mathbf{x} \in C_i}\sum_{\mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})

✓ Balance between single and complete
✓ Good general-purpose choice
• Most commonly used in practice

Ward's Method (Minimum Variance)

Merge clusters that minimize increase in total within-cluster variance:

\Delta(C_i, C_j) = \frac{|C_i||C_j|}{|C_i|+|C_j|}\|\boldsymbol{\mu}_i - \boldsymbol{\mu}_j\|^2

Equivalent to minimizing sum of squared distances to cluster centroids. Often gives best results!

Representative Example

Spectral Clustering on Non-Convex Data

Problem: Cluster "two moons" dataset where k-means fails

Step 1: Compute RBF similarity: W_ij = exp(-||x_i - x_j||²/0.5)

Step 2: Build normalized Laplacian: L_sym = I - D^(-1/2) W D^(-1/2)

Step 3: Compute first 2 eigenvectors v_1, v_2

Step 4: Apply k-means (k=2) on [v_1 | v_2]

Result: Perfect separation of two moons! K-means on original space fails.

View More Examples & Exercises

Spectral Clustering: Advanced Topics

Choosing Number of Clusters via Eigengap

Eigengap Heuristic

Sort eigenvalues: 0 = λ_1 ≤ λ_2 ≤ ... ≤ λ_n

Choose k where eigengap λ_(k+1) - λ_k is largest

Intuition:

For k well-separated clusters, first k eigenvalues are small (near 0), then large jump to λ_(k+1)

Normalized vs Unnormalized Laplacian

Unnormalized L

• Sensitive to cluster sizes
• Prefers balanced cuts
• Eigenvalues not bounded

Normalized L_sym

• Handles varying cluster sizes
• More robust in practice
• Eigenvalues in [0,2]

Computational Efficiency

Bottleneck: Eigendecomposition of n×n matrix

• Full eigendecomposition: O(n³)
• Sparse matrices (k-NN graphs): O(nk²) using Lanczos/Arnoldi
• Nyström approximation: O(m²n) where m ≪ n (landmark points)

For large graphs (n > 10K), use approximate methods!

GMM: Model Selection

Choosing Number of Components K

Bayesian Information Criterion (BIC)

\text{BIC} = -2\log \mathcal{L}(\hat{\theta}) + p\log n

• L = likelihood, p = number of parameters, n = number of samples
• For GMM: p = K(d + d(d+1)/2 + 1) - 1
• Choose K with lowest BIC

Akaike Information Criterion (AIC)

\text{AIC} = -2\log \mathcal{L}(\hat{\theta}) + 2p

• Less penalty than BIC → tends to select larger K
• BIC preferred for model selection, AIC for prediction

Silhouette Score

Compute silhouette score for each K, choose K with highest average score. Model-agnostic alternative to BIC/AIC.

Density Peaks Clustering

Finding Cluster Centers by Density and Distance

Key Idea

Cluster centers characterized by:

High local density ρ_i
Large distance δ_i to nearest point with higher density

Density Computation

\rho_i = \sum_{j} \exp\left(-\frac{d_{ij}^2}{d_c^2}\right)

or simply: ρ_i = number of neighbors within cutoff d_c

Distance to Higher Density

\delta_i = \begin{cases} \min_{j:\rho_j > \rho_i} d_{ij} & \text{if } \exists j: \rho_j > \rho_i \\ \max_j d_{ij} & \text{otherwise} \end{cases}

Plot ρ vs δ: points with both high → cluster centers!

Algorithm Steps

1. Compute ρ_i for all points

2. Compute δ_i for all points

3. Select centers: points with γ_i = ρ_i × δ_i above threshold

4. Assign remaining points to nearest center with higher density

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

Spectral clustering uses:

Not attempted

2

What does GMM (Gaussian Mixture Model) assume?

Not attempted

3

The EM algorithm for GMM alternates between:

Not attempted

4

Compared to k-means, GMM:

Not attempted

5

In spectral clustering, the unnormalized graph Laplacian is:

Not attempted

6

Affinity propagation determines clusters by:

Not attempted

7

Mean shift clustering:

Not attempted

8

Consensus (ensemble) clustering combines:

Not attempted

9

When is spectral clustering preferred over k-means?

Not attempted

10

What is the main challenge with GMM?

Not attempted

Advanced Clustering

Spectral Clustering Mathematical Foundation

Graph Construction

Graph Laplacian Matrices

Key Spectral Properties

Spectral Clustering Algorithm

Why It Works: Normalized Cut

Gaussian Mixture Model & EM Algorithm

GMM Model

E-Step: Compute Responsibilities

M-Step: Update Parameters

Convergence Guarantee

GMM vs K-Means

Mean Shift Algorithm

Kernel Density Estimation

Mean Shift Vector

Algorithm

Properties

GMM: Complete EM Derivation

Lower Bound (ELBO)

E-Step Derivation

M-Step Derivation for Parameters

Convergence Properties

Affinity Propagation Message Passing

Problem Formulation

Message Passing Equations

Algorithm

Key Properties

Hierarchical Clustering Mathematical Analysis

Single Linkage (MIN)

Complete Linkage (MAX)

Average Linkage (UPGMA)

Ward's Method (Minimum Variance)

Representative Example

Spectral Clustering: Advanced Topics

Eigengap Heuristic

Normalized vs Unnormalized Laplacian

Computational Efficiency

GMM: Model Selection

Bayesian Information Criterion (BIC)

Akaike Information Criterion (AIC)

Silhouette Score

Density Peaks Clustering

Key Idea

Density Computation

Distance to Higher Density

Algorithm Steps

Practice Quiz