MathIsimple
Unsupervised Learning
6-8 Hours

Cluster Analysis

Discover natural groupings in data using hierarchical clustering, K-means, and validation techniques

Learning Objectives
Understand distance and similarity measures
Apply hierarchical clustering methods
Implement and interpret K-means
Choose optimal number of clusters
Validate cluster quality
Compare different clustering algorithms

Distance Measures

Common Distances

Euclidean

d=i=1p(xiyi)2d = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}

Manhattan

d=i=1pxiyid = \sum_{i=1}^p |x_i - y_i|

Mahalanobis

d=(xy)TΣ1(xy)d = \sqrt{(\mathbf{x}-\mathbf{y})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\mathbf{y})}

Correlation-based

d = 1 - r measures dissimilarity in pattern

Clustering Methods

Hierarchical Clustering

Single Linkage

Min distance. Can create chains.

Complete Linkage

Max distance. Compact clusters.

Average Linkage

Mean pairwise distance. Balanced approach.

Ward's Method

Minimizes variance increase. Often preferred.

K-Means Algorithm

Objective Function

mink=1KiCkxiμk2\min \sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2

Algorithm Steps

  1. Initialize K centroids (random or k-means++)
  2. Assign each point to nearest centroid
  3. Update centroids as cluster means
  4. Repeat steps 2-3 until convergence

Choosing K & Validation

Selection Methods

Elbow Method

Plot WSS vs K. Look for elbow where decrease slows.

Silhouette

Average silhouette width. Higher is better (max at 1).

Gap Statistic

Compare to null reference distribution.

Dendrogram

Cut tree where large gaps appear.

Silhouette Coefficient

For each point i, the silhouette coefficient measures clustering quality:

s(i)=b(i)a(i)max{a(i),b(i)}s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}

a(i)a(i)

Average distance to points in same cluster

b(i)b(i)

Average distance to points in nearest other cluster

Interpretation: s ≈ 1: well-clustered; s ≈ 0: on cluster boundary; s < 0: possibly misclassified

Advanced Methods

DBSCAN

Density-Based Spatial Clustering of Applications with Noise:

Parameters

ϵ\epsilon (neighborhood radius), MinPts (minimum points for core)

Advantages

No need to specify K, handles arbitrary shapes, identifies noise

Gaussian Mixture Models

Probabilistic model assuming data comes from mixture of Gaussians:

p(x)=k=1KπkN(xμk,Σk)p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

Soft Clustering

Points have probabilities of belonging to each cluster

Estimation

EM algorithm iterates E-step and M-step

Practical Considerations

Data Preprocessing

Standardization

Usually standardize to mean 0, SD 1 so all variables contribute equally

Outliers

Can strongly affect clustering. Consider removal or robust methods.

Missing Data

Impute or use methods that handle missing values

Dimensionality

Consider PCA before clustering if many variables

Cluster Comparison

Method Comparison

K-Means

Fast, spherical clusters, requires K, sensitive to initialization

Hierarchical

No K needed, dendrogram visualization, O(n²) complexity

DBSCAN

Arbitrary shapes, handles noise, requires ε and MinPts

Worked Example

K-Means Example

Given 6 points and K=2:

Iteration 1

Randomly initialize centroids, assign points to nearest centroid

Update

Recompute centroids as cluster means, repeat until convergence

Convergence: Algorithm stops when assignments no longer change or max iterations reached.

Advanced Clustering Methods

DBSCAN (Density-Based)

Density-Based Spatial Clustering of Applications with Noise:

Parameters

ε (neighborhood radius), MinPts (minimum points)

Key Concepts

Core points, border points, noise points

Advantages: No need to specify K, can find arbitrary shapes, identifies outliers

Gaussian Mixture Models (GMM)

Probabilistic clustering assuming data comes from mixture of Gaussians:

p(x)=k=1KπkN(xμk,Σk)p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

EM Algorithm

E-step: compute responsibilities; M-step: update parameters

Soft Clustering

Probabilistic assignment P(zi=kxi)P(z_i = k | \mathbf{x}_i)

Spectral Clustering

Uses eigenvalues of similarity matrix:

  1. Construct similarity graph and Laplacian matrix
  2. Compute first k eigenvectors of Laplacian
  3. Apply K-means to embedded representation

Use case: Non-convex clusters, image segmentation, community detection

Cluster Validation

Internal Validation Indices

Silhouette Coefficient

s(i)=b(i)a(i)max(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Range: [-1, 1]; higher is better

Calinski-Harabasz Index

CH=tr(BK)/(K1)tr(WK)/(nK)CH = \frac{\text{tr}(\mathbf{B}_K)/(K-1)}{\text{tr}(\mathbf{W}_K)/(n-K)}

Higher is better

Davies-Bouldin Index

Average similarity between clusters; lower is better

Gap Statistic

Compare within-cluster dispersion to null reference

External Validation (with Ground Truth)

Adjusted Rand Index (ARI)

Measures agreement adjusted for chance; range [-1, 1]

Normalized Mutual Information (NMI)

Information-theoretic measure; range [0, 1]

Determining Number of Clusters

Methods for Choosing K

Elbow Method

Plot WSS vs K; look for "elbow" where decline slows

Silhouette Analysis

Choose K with highest average silhouette width

Gap Statistic

Choose smallest K where Gap(K) ≥ Gap(K+1) - SE

Information Criteria

BIC/AIC for model-based clustering (GMM)

Applications

Common Use Cases

Customer Segmentation

Group customers by behavior for targeted marketing

Image Segmentation

Partition images into regions by color/texture

Document Clustering

Group documents by topic for information retrieval

Genomics

Gene expression clustering, species classification

Best Practices

Clustering Guidelines

Data Preparation

  • • Standardize variables if scales differ
  • • Handle outliers appropriately
  • • Consider dimensionality reduction

Method Selection

  • • K-means: spherical, similar-sized clusters
  • • Hierarchical: explore structure, small data
  • • DBSCAN: arbitrary shapes, noise

Remember: Clustering is exploratory. Results should be validated with domain knowledge and multiple methods.

Software Implementation

Common Software

R

kmeans(), hclust(), dbscan::dbscan()

Python

sklearn.cluster (KMeans, DBSCAN, etc.)

SPSS

Analyze → Classify → K-Means/Hierarchical

SAS

PROC CLUSTER, PROC FASTCLUS

Practical Workflow

Recommended Steps
  1. Explore data: distributions, outliers, correlations
  2. Preprocess: standardize, handle missing values
  3. Choose distance metric appropriate for data type
  4. Select algorithm based on data characteristics
  5. Determine number of clusters using multiple criteria
  6. Run clustering and examine results
  7. Validate with internal indices and domain knowledge
  8. Profile clusters to interpret meaning
Cluster Profiling

After clustering, characterize each cluster:

Descriptive Statistics

Mean, median, SD for each variable by cluster

Visualization

Parallel coordinates, radar charts, cluster plots

Cluster Validation

Internal Validation Indices

Silhouette Coefficient

s(i)=b(i)a(i)max{a(i),b(i)}s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}

Range: [-1, 1]; Higher is better

Davies-Bouldin Index

Lower values indicate better clustering

Advanced Clustering Methods

DBSCAN

Density-based clustering that can find arbitrarily shaped clusters:

  • • Doesn't require specifying number of clusters
  • • Can identify outliers as noise points
  • • Works well with non-spherical clusters
  • • Parameters: epsilon (neighborhood radius), minPts
Gaussian Mixture Models

Probabilistic model-based clustering:

p(x)=k=1KπkN(xμk,Σk)p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

Advantages

Soft clustering; principled probability model

Estimation

EM algorithm for parameter estimation

Practical Guidelines

Complete Workflow
  1. Preprocess: Handle missing data, standardize variables
  2. Choose method: Based on data characteristics and goals
  3. Determine K: Use elbow, silhouette, or gap statistic
  4. Run algorithm: Multiple initializations for K-means
  5. Validate: Internal and external validation indices
  6. Interpret: Profile clusters using descriptive statistics
  7. Visualize: PCA plots, dendrograms, heatmaps

Mathematical Foundations

Hierarchical Linkage Methods

For clusters C₁ and C₂, linkage distances are:

Single Linkage

d(C1,C2)=miniC1,jC2d(i,j)d(C_1, C_2) = \min_{i \in C_1, j \in C_2} d(i, j)

Complete Linkage

d(C1,C2)=maxiC1,jC2d(i,j)d(C_1, C_2) = \max_{i \in C_1, j \in C_2} d(i, j)

Average Linkage

d(C1,C2)=1C1C2iC1jC2d(i,j)d(C_1, C_2) = \frac{1}{|C_1||C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(i, j)

Ward's Method

Δ=C1C2C1+C2xˉC1xˉC22\Delta = \frac{|C_1||C_2|}{|C_1|+|C_2|} \|\bar{\mathbf{x}}_{C_1} - \bar{\mathbf{x}}_{C_2}\|^2
K-Means Convergence

The objective function decreases monotonically:

J=k=1KiCkxiμk2J = \sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2

Guarantee

Always converges to local minimum

Issue

May not find global minimum; run multiple times

Complete Numerical Examples

K-Means Step-by-Step

Given: 6 points, K=2

A(1,1), B(1.5,2), C(3,4), D(5,7), E(3.5,5), F(4.5,5)

Initialize: μ₁=(1,1), μ₂=(5,7)

Iteration 1 - Assign:

C₁ = {A,B,C}, C₂ = {D,E,F}

Update centroids:

μ1=(1.83,2.33),μ2=(4.33,5.67)\boldsymbol{\mu}_1 = (1.83, 2.33), \quad \boldsymbol{\mu}_2 = (4.33, 5.67)

Result: After convergence (typically 3-5 iterations), clusters stabilize with final WSS = 14.22

Hierarchical Example with Distance Matrix

Distance matrix for 5 points:

D=(026109205986504510940398530)\mathbf{D} = \begin{pmatrix} 0 & 2 & 6 & 10 & 9 \\ 2 & 0 & 5 & 9 & 8 \\ 6 & 5 & 0 & 4 & 5 \\ 10 & 9 & 4 & 0 & 3 \\ 9 & 8 & 5 & 3 & 0 \end{pmatrix}

Complete linkage steps:

  1. Merge {1,2} (d=2); Update distance matrix
  2. Merge {4,5} (d=3); Update distances
  3. Merge {3} with {4,5} (d=5)
  4. Finally merge {1,2} with {3,4,5} (d=9)

Initialization Strategies

K-Means++ Algorithm

Smart initialization for better convergence:

  1. Choose first centroid uniformly at random
  2. For each point x, compute D(x) = distance to nearest centroid
  3. Choose next centroid with probability ∝ D(x)²
  4. Repeat until K centroids selected

Advantage

Spreads initial centroids; O(log K) competitive ratio

Performance

Often converges faster than random initialization

Stability Analysis

Bootstrap Methods

Assess cluster reliability through resampling:

  1. Cluster original data
  2. Generate B bootstrap samples (e.g., B=100)
  3. Cluster each bootstrap sample
  4. Compare clusterings using Jaccard coefficient
  5. Stable clusters: Jaccard > 0.75

Jaccard Index

J=ABABJ = \frac{|A \cap B|}{|A \cup B|}

Interpretation

Measures overlap between cluster pairs

Distance Metric Selection

Metric Properties

Valid distance metrics must satisfy:

Non-negativity

d(x,y)0d(\mathbf{x}, \mathbf{y}) \geq 0

Identity

d(x,y)=0    x=yd(\mathbf{x}, \mathbf{y}) = 0 \iff \mathbf{x} = \mathbf{y}

Symmetry

d(x,y)=d(y,x)d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})

Triangle Inequality

d(x,z)d(x,y)+d(y,z)d(\mathbf{x}, \mathbf{z}) \leq d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})
Choosing the Right Metric

Euclidean

Best for continuous data with similar scales

Manhattan

Robust to outliers; grid-like spaces

Mahalanobis

Accounts for correlations; scale-invariant

Cosine

Text data; direction more important than magnitude

Gap Statistic Method

Gap Statistic Definition

Compare within-cluster dispersion to null reference distribution:

Gap(K)=En[log(WK)]log(WK)\text{Gap}(K) = E^*_n[\log(W_K)] - \log(W_K)

W_K

Within-cluster sum of squares for K clusters

E*_n

Expected value under null (uniform) distribution

Selection rule: Choose smallest K where Gap(K) ≥ Gap(K+1) - SE(K+1)

Algorithm Steps
  1. Cluster data for K=1,2,...,K_max; compute W_K for each
  2. Generate B reference datasets (uniform over bounding box)
  3. Cluster each reference dataset; compute W*_Kb
  4. Calculate Gap(K) = (1/B)Σ log(W*_Kb) - log(W_K)
  5. Compute standard error SE(K)
  6. Choose K where Gap(K) ≥ Gap(K+1) - SE(K+1)

Model-Based Clustering

Gaussian Mixture Models

Assume data generated from mixture of K Gaussian distributions:

p(x)=k=1KπkN(xμk,Σk)p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

Parameters to estimate: θ={πk,μk,Σk}k=1K\boldsymbol{\theta} = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}_{k=1}^K

Mixing Weights

k=1Kπk=1\sum_{k=1}^K \pi_k = 1

Means

μkRp\boldsymbol{\mu}_k \in \mathbb{R}^p

Covariances

ΣkRp×p\boldsymbol{\Sigma}_k \in \mathbb{R}^{p \times p}

EM Algorithm for GMM

E-step: Compute responsibilities (posterior probabilities)

γik=πkN(xiμk,Σk)j=1KπjN(xiμj,Σj)\gamma_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

M-step: Update parameters

πk=1ni=1nγik\pi_k = \frac{1}{n}\sum_{i=1}^n \gamma_{ik}μk=i=1nγikxii=1nγik\boldsymbol{\mu}_k = \frac{\sum_{i=1}^n \gamma_{ik}\mathbf{x}_i}{\sum_{i=1}^n \gamma_{ik}}Σk=i=1nγik(xiμk)(xiμk)Ti=1nγik\boldsymbol{\Sigma}_k = \frac{\sum_{i=1}^n \gamma_{ik}(\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T}{\sum_{i=1}^n \gamma_{ik}}
Model Selection with BIC

Choose K that minimizes Bayesian Information Criterion:

BIC=2logL+νlogn\text{BIC} = -2\log L + \nu \log n

Log-Likelihood

logL=i=1nlogp(xiθ)\log L = \sum_{i=1}^n \log p(\mathbf{x}_i|\boldsymbol{\theta})

Parameters

ν = number of free parameters

Advantage: Balances model fit with complexity; penalizes overfitting

Hierarchical Clustering Deep Dive

Agglomerative vs Divisive

Agglomerative (Bottom-Up)

  • • Start with n clusters (each point)
  • • Merge closest clusters iteratively
  • • More common, O(n² log n)

Divisive (Top-Down)

  • • Start with 1 cluster (all points)
  • • Split clusters recursively
  • • Less common, O(2ⁿ)
Dendrogram Interpretation

Reading a dendrogram:

Height

Distance at which clusters merge; taller = more dissimilar

Cutting

Horizontal cut at height h gives partition into clusters

Large Gaps

Cut where vertical distance is largest

Balance

Unbalanced tree may indicate outliers or poor linkage choice

Computational Aspects

Time Complexity

K-Means

O(nKpT)O(nKpT)

T iterations typically < 100

Hierarchical

O(n2logn)O(n^2 \log n)

Can be O(n³) for some linkages

DBSCAN

O(nlogn)O(n \log n)

With spatial index; O(n²) without

Scalability Tips
  • Mini-batch K-means: Use random subsets for faster convergence
  • Sampling: Cluster sample, then assign full dataset
  • Dimensionality reduction: PCA before clustering reduces p
  • Approximate methods: LSH for nearest neighbor search
  • Parallelization: K-means easily parallelizable

Practice Quiz

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
Cluster analysis is an example of:
Not attempted
2
Euclidean distance between points x\mathbf{x} and y\mathbf{y} is:
Not attempted
3
Single linkage clustering defines cluster distance as:
Not attempted
4
Complete linkage uses which cluster distance?
Not attempted
5
Ward's method minimizes:
Not attempted
6
K-means algorithm requires:
Not attempted
7
K-means minimizes:
Not attempted
8
The elbow method for choosing K plots:
Not attempted
9
Silhouette coefficient ranges from:
Not attempted
10
A dendrogram shows:
Not attempted

FAQ

Should I standardize variables before clustering?

Usually yes, especially if variables have different scales. Otherwise, high-variance variables dominate distance calculations.

How do I choose between hierarchical and K-means?

Hierarchical is good for exploring structure and doesn't require specifying K upfront. K-means is faster for large datasets but needs K specified.

What if clusters have different shapes/sizes?

K-means assumes spherical clusters. Consider DBSCAN, GMM, or spectral clustering for non-spherical clusters.

How many clusters should I use?

Use multiple criteria: elbow method, silhouette scores, gap statistic, and domain knowledge. No single best method - consider interpretability.

Can I use categorical variables in clustering?

Not directly with Euclidean distance. Use Gower distance, convert to dummy variables, or use specialized methods like k-modes.

Ask AI ✨
MathIsimple – Simple, Friendly Math Tools & Learning