MathIsimple – Simple, Friendly Math Tools & Learning

Learning Objectives

Understand distance and similarity measures

Apply hierarchical clustering methods

Implement and interpret K-means

Choose optimal number of clusters

Validate cluster quality

Compare different clustering algorithms

Distance Measures

Common Distances

Euclidean

d = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}

Manhattan

d = \sum_{i=1}^p |x_i - y_i|

Mahalanobis

d = \sqrt{(\mathbf{x}-\mathbf{y})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\mathbf{y})}

Correlation-based

d = 1 - r measures dissimilarity in pattern

Clustering Methods

Hierarchical Clustering

Single Linkage

Min distance. Can create chains.

Complete Linkage

Max distance. Compact clusters.

Average Linkage

Mean pairwise distance. Balanced approach.

Ward's Method

Minimizes variance increase. Often preferred.

K-Means Algorithm

Objective Function

\min \sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2

Algorithm Steps

Initialize K centroids (random or k-means++)
Assign each point to nearest centroid
Update centroids as cluster means
Repeat steps 2-3 until convergence

Choosing K & Validation

Selection Methods

Elbow Method

Plot WSS vs K. Look for elbow where decrease slows.

Silhouette

Average silhouette width. Higher is better (max at 1).

Gap Statistic

Compare to null reference distribution.

Dendrogram

Cut tree where large gaps appear.

Silhouette Coefficient

For each point i, the silhouette coefficient measures clustering quality:

s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}

$a(i)$

Average distance to points in same cluster

$b(i)$

Average distance to points in nearest other cluster

Interpretation: s ≈ 1: well-clustered; s ≈ 0: on cluster boundary; s < 0: possibly misclassified

Advanced Methods

DBSCAN

Density-Based Spatial Clustering of Applications with Noise:

Parameters

$\epsilon$ (neighborhood radius), MinPts (minimum points for core)

Advantages

No need to specify K, handles arbitrary shapes, identifies noise

Gaussian Mixture Models

Probabilistic model assuming data comes from mixture of Gaussians:

p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

Soft Clustering

Points have probabilities of belonging to each cluster

Estimation

EM algorithm iterates E-step and M-step

Practical Considerations

Data Preprocessing

Standardization

Usually standardize to mean 0, SD 1 so all variables contribute equally

Outliers

Can strongly affect clustering. Consider removal or robust methods.

Missing Data

Impute or use methods that handle missing values

Dimensionality

Consider PCA before clustering if many variables

Cluster Comparison

Method Comparison

K-Means

Fast, spherical clusters, requires K, sensitive to initialization

Hierarchical

No K needed, dendrogram visualization, O(n²) complexity

DBSCAN

Arbitrary shapes, handles noise, requires ε and MinPts

Worked Example

K-Means Example

Given 6 points and K=2:

Iteration 1

Randomly initialize centroids, assign points to nearest centroid

Update

Recompute centroids as cluster means, repeat until convergence

Convergence: Algorithm stops when assignments no longer change or max iterations reached.

Advanced Clustering Methods

DBSCAN (Density-Based)

Density-Based Spatial Clustering of Applications with Noise:

Parameters

ε (neighborhood radius), MinPts (minimum points)

Key Concepts

Core points, border points, noise points

Advantages: No need to specify K, can find arbitrary shapes, identifies outliers

Gaussian Mixture Models (GMM)

Probabilistic clustering assuming data comes from mixture of Gaussians:

p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

EM Algorithm

E-step: compute responsibilities; M-step: update parameters

Soft Clustering

Probabilistic assignment $P(z_i = k | \mathbf{x}_i)$

Spectral Clustering

Uses eigenvalues of similarity matrix:

Construct similarity graph and Laplacian matrix
Compute first k eigenvectors of Laplacian
Apply K-means to embedded representation

Use case: Non-convex clusters, image segmentation, community detection

Cluster Validation

Internal Validation Indices

Silhouette Coefficient

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Range: [-1, 1]; higher is better

Calinski-Harabasz Index

CH = \frac{\text{tr}(\mathbf{B}_K)/(K-1)}{\text{tr}(\mathbf{W}_K)/(n-K)}

Higher is better

Davies-Bouldin Index

Average similarity between clusters; lower is better

Gap Statistic

Compare within-cluster dispersion to null reference

External Validation (with Ground Truth)

Adjusted Rand Index (ARI)

Measures agreement adjusted for chance; range [-1, 1]

Normalized Mutual Information (NMI)

Information-theoretic measure; range [0, 1]

Determining Number of Clusters

Methods for Choosing K

Elbow Method

Plot WSS vs K; look for "elbow" where decline slows

Silhouette Analysis

Choose K with highest average silhouette width

Gap Statistic

Choose smallest K where Gap(K) ≥ Gap(K+1) - SE

Information Criteria

BIC/AIC for model-based clustering (GMM)

Applications

Common Use Cases

Customer Segmentation

Group customers by behavior for targeted marketing

Image Segmentation

Partition images into regions by color/texture

Document Clustering

Group documents by topic for information retrieval

Genomics

Gene expression clustering, species classification

Best Practices

Clustering Guidelines

Data Preparation

• Standardize variables if scales differ
• Handle outliers appropriately
• Consider dimensionality reduction

Method Selection

• K-means: spherical, similar-sized clusters
• Hierarchical: explore structure, small data
• DBSCAN: arbitrary shapes, noise

Remember: Clustering is exploratory. Results should be validated with domain knowledge and multiple methods.

Software Implementation

Common Software

R

kmeans(), hclust(), dbscan::dbscan()

Python

sklearn.cluster (KMeans, DBSCAN, etc.)

SPSS

Analyze → Classify → K-Means/Hierarchical

SAS

PROC CLUSTER, PROC FASTCLUS

Practical Workflow

Recommended Steps

Explore data: distributions, outliers, correlations
Preprocess: standardize, handle missing values
Choose distance metric appropriate for data type
Select algorithm based on data characteristics
Determine number of clusters using multiple criteria
Run clustering and examine results
Validate with internal indices and domain knowledge
Profile clusters to interpret meaning

Cluster Profiling

After clustering, characterize each cluster:

Descriptive Statistics

Mean, median, SD for each variable by cluster

Visualization

Parallel coordinates, radar charts, cluster plots

Cluster Validation

Internal Validation Indices

Silhouette Coefficient

s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}

Range: [-1, 1]; Higher is better

Davies-Bouldin Index

Lower values indicate better clustering

Advanced Clustering Methods

DBSCAN

Density-based clustering that can find arbitrarily shaped clusters:

• Doesn't require specifying number of clusters
• Can identify outliers as noise points
• Works well with non-spherical clusters
• Parameters: epsilon (neighborhood radius), minPts

Gaussian Mixture Models

Probabilistic model-based clustering:

p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

Advantages

Soft clustering; principled probability model

Estimation

EM algorithm for parameter estimation

Practical Guidelines

Complete Workflow

Preprocess: Handle missing data, standardize variables
Choose method: Based on data characteristics and goals
Determine K: Use elbow, silhouette, or gap statistic
Run algorithm: Multiple initializations for K-means
Validate: Internal and external validation indices
Interpret: Profile clusters using descriptive statistics
Visualize: PCA plots, dendrograms, heatmaps

Mathematical Foundations

Hierarchical Linkage Methods

For clusters C₁ and C₂, linkage distances are:

Single Linkage

d(C_1, C_2) = \min_{i \in C_1, j \in C_2} d(i, j)

Complete Linkage

d(C_1, C_2) = \max_{i \in C_1, j \in C_2} d(i, j)

Average Linkage

d(C_1, C_2) = \frac{1}{|C_1||C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(i, j)

Ward's Method

\Delta = \frac{|C_1||C_2|}{|C_1|+|C_2|} \|\bar{\mathbf{x}}_{C_1} - \bar{\mathbf{x}}_{C_2}\|^2

K-Means Convergence

The objective function decreases monotonically:

J = \sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2

Guarantee

Always converges to local minimum

Issue

May not find global minimum; run multiple times

Complete Numerical Examples

K-Means Step-by-Step

Given: 6 points, K=2

A(1,1), B(1.5,2), C(3,4), D(5,7), E(3.5,5), F(4.5,5)

Initialize: μ₁=(1,1), μ₂=(5,7)

Iteration 1 - Assign:

C₁ = {A,B,C}, C₂ = {D,E,F}

Update centroids:

\boldsymbol{\mu}_1 = (1.83, 2.33), \quad \boldsymbol{\mu}_2 = (4.33, 5.67)

Result: After convergence (typically 3-5 iterations), clusters stabilize with final WSS = 14.22

Hierarchical Example with Distance Matrix

Distance matrix for 5 points:

\mathbf{D} = \begin{pmatrix} 0 & 2 & 6 & 10 & 9 \\ 2 & 0 & 5 & 9 & 8 \\ 6 & 5 & 0 & 4 & 5 \\ 10 & 9 & 4 & 0 & 3 \\ 9 & 8 & 5 & 3 & 0 \end{pmatrix}

Complete linkage steps:

Merge {1,2} (d=2); Update distance matrix
Merge {4,5} (d=3); Update distances
Merge {3} with {4,5} (d=5)
Finally merge {1,2} with {3,4,5} (d=9)

Initialization Strategies

K-Means++ Algorithm

Smart initialization for better convergence:

Choose first centroid uniformly at random
For each point x, compute D(x) = distance to nearest centroid
Choose next centroid with probability ∝ D(x)²
Repeat until K centroids selected

Advantage

Spreads initial centroids; O(log K) competitive ratio

Performance

Often converges faster than random initialization

Stability Analysis

Bootstrap Methods

Assess cluster reliability through resampling:

Cluster original data
Generate B bootstrap samples (e.g., B=100)
Cluster each bootstrap sample
Compare clusterings using Jaccard coefficient
Stable clusters: Jaccard > 0.75

Jaccard Index

J = \frac{|A \cap B|}{|A \cup B|}

Interpretation

Measures overlap between cluster pairs

Distance Metric Selection

Metric Properties

Valid distance metrics must satisfy:

Non-negativity

d(\mathbf{x}, \mathbf{y}) \geq 0

Identity

d(\mathbf{x}, \mathbf{y}) = 0 \iff \mathbf{x} = \mathbf{y}

Symmetry

d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})

Triangle Inequality

d(\mathbf{x}, \mathbf{z}) \leq d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})

Choosing the Right Metric

Euclidean

Best for continuous data with similar scales

Manhattan

Robust to outliers; grid-like spaces

Mahalanobis

Accounts for correlations; scale-invariant

Cosine

Text data; direction more important than magnitude

Gap Statistic Method

Gap Statistic Definition

Compare within-cluster dispersion to null reference distribution:

\text{Gap}(K) = E^*_n[\log(W_K)] - \log(W_K)

W_K

Within-cluster sum of squares for K clusters

E*_n

Expected value under null (uniform) distribution

Selection rule: Choose smallest K where Gap(K) ≥ Gap(K+1) - SE(K+1)

Algorithm Steps

Cluster data for K=1,2,...,K_max; compute W_K for each
Generate B reference datasets (uniform over bounding box)
Cluster each reference dataset; compute W*_Kb
Calculate Gap(K) = (1/B)Σ log(W*_Kb) - log(W_K)
Compute standard error SE(K)
Choose K where Gap(K) ≥ Gap(K+1) - SE(K+1)

Model-Based Clustering

Gaussian Mixture Models

Assume data generated from mixture of K Gaussian distributions:

p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

Parameters to estimate: $\boldsymbol{\theta} = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}_{k=1}^K$

Mixing Weights

\sum_{k=1}^K \pi_k = 1

Means

$\boldsymbol{\mu}_k \in \mathbb{R}^p$

Covariances

$\boldsymbol{\Sigma}_k \in \mathbb{R}^{p \times p}$

EM Algorithm for GMM

E-step: Compute responsibilities (posterior probabilities)

\gamma_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

M-step: Update parameters

\pi_k = \frac{1}{n}\sum_{i=1}^n \gamma_{ik}

\boldsymbol{\mu}_k = \frac{\sum_{i=1}^n \gamma_{ik}\mathbf{x}_i}{\sum_{i=1}^n \gamma_{ik}}

\boldsymbol{\Sigma}_k = \frac{\sum_{i=1}^n \gamma_{ik}(\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T}{\sum_{i=1}^n \gamma_{ik}}

Model Selection with BIC

Choose K that minimizes Bayesian Information Criterion:

\text{BIC} = -2\log L + \nu \log n

Log-Likelihood

\log L = \sum_{i=1}^n \log p(\mathbf{x}_i|\boldsymbol{\theta})

Parameters

ν = number of free parameters

Advantage: Balances model fit with complexity; penalizes overfitting

Hierarchical Clustering Deep Dive

Agglomerative vs Divisive

Agglomerative (Bottom-Up)

• Start with n clusters (each point)
• Merge closest clusters iteratively
• More common, O(n² log n)

Divisive (Top-Down)

• Start with 1 cluster (all points)
• Split clusters recursively
• Less common, O(2ⁿ)

Dendrogram Interpretation

Reading a dendrogram:

Height

Distance at which clusters merge; taller = more dissimilar

Cutting

Horizontal cut at height h gives partition into clusters

Large Gaps

Cut where vertical distance is largest

Balance

Unbalanced tree may indicate outliers or poor linkage choice

Computational Aspects

Time Complexity

K-Means

O(nKpT)

T iterations typically < 100

Hierarchical

O(n^2 \log n)

Can be O(n³) for some linkages

DBSCAN

O(n \log n)

With spatial index; O(n²) without

Scalability Tips

• Mini-batch K-means: Use random subsets for faster convergence
• Sampling: Cluster sample, then assign full dataset
• Dimensionality reduction: PCA before clustering reduces p
• Approximate methods: LSH for nearest neighbor search
• Parallelization: K-means easily parallelizable

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

Cluster analysis is an example of:

Not attempted

2

Euclidean distance between points

\mathbf{x}

and

\mathbf{y}

is:

Not attempted

3

Single linkage clustering defines cluster distance as:

Not attempted

4

Complete linkage uses which cluster distance?

Not attempted

5

Ward's method minimizes:

Not attempted

6

K-means algorithm requires:

Not attempted

7

K-means minimizes:

Not attempted

8

The elbow method for choosing K plots:

Not attempted

9

Silhouette coefficient ranges from:

Not attempted

10

A dendrogram shows:

Not attempted

FAQ

Should I standardize variables before clustering?

Usually yes, especially if variables have different scales. Otherwise, high-variance variables dominate distance calculations.

How do I choose between hierarchical and K-means?

Hierarchical is good for exploring structure and doesn't require specifying K upfront. K-means is faster for large datasets but needs K specified.

What if clusters have different shapes/sizes?

K-means assumes spherical clusters. Consider DBSCAN, GMM, or spectral clustering for non-spherical clusters.

How many clusters should I use?

Use multiple criteria: elbow method, silhouette scores, gap statistic, and domain knowledge. No single best method - consider interpretability.

Can I use categorical variables in clustering?

Not directly with Euclidean distance. Use Gower distance, convert to dummy variables, or use specialized methods like k-modes.

Cluster Analysis

Distance Measures

Clustering Methods

Choosing K & Validation

Advanced Methods

Practical Considerations

Cluster Comparison

Worked Example

Advanced Clustering Methods

Cluster Validation

Determining Number of Clusters

Applications

Best Practices

Software Implementation

Practical Workflow

Cluster Validation

Advanced Clustering Methods

Practical Guidelines

Mathematical Foundations

Complete Numerical Examples

Initialization Strategies

Stability Analysis

Distance Metric Selection

Gap Statistic Method

Model-Based Clustering

Hierarchical Clustering Deep Dive

Computational Aspects

Practice Quiz

FAQ