Discover natural groupings in data using hierarchical clustering, K-means, and validation techniques
Euclidean
Manhattan
Mahalanobis
Correlation-based
d = 1 - r measures dissimilarity in pattern
Single Linkage
Min distance. Can create chains.
Complete Linkage
Max distance. Compact clusters.
Average Linkage
Mean pairwise distance. Balanced approach.
Ward's Method
Minimizes variance increase. Often preferred.
Objective Function
Algorithm Steps
Elbow Method
Plot WSS vs K. Look for elbow where decrease slows.
Silhouette
Average silhouette width. Higher is better (max at 1).
Gap Statistic
Compare to null reference distribution.
Dendrogram
Cut tree where large gaps appear.
For each point i, the silhouette coefficient measures clustering quality:
Average distance to points in same cluster
Average distance to points in nearest other cluster
Interpretation: s ≈ 1: well-clustered; s ≈ 0: on cluster boundary; s < 0: possibly misclassified
Density-Based Spatial Clustering of Applications with Noise:
Parameters
(neighborhood radius), MinPts (minimum points for core)
Advantages
No need to specify K, handles arbitrary shapes, identifies noise
Probabilistic model assuming data comes from mixture of Gaussians:
Soft Clustering
Points have probabilities of belonging to each cluster
Estimation
EM algorithm iterates E-step and M-step
Standardization
Usually standardize to mean 0, SD 1 so all variables contribute equally
Outliers
Can strongly affect clustering. Consider removal or robust methods.
Missing Data
Impute or use methods that handle missing values
Dimensionality
Consider PCA before clustering if many variables
K-Means
Fast, spherical clusters, requires K, sensitive to initialization
Hierarchical
No K needed, dendrogram visualization, O(n²) complexity
DBSCAN
Arbitrary shapes, handles noise, requires ε and MinPts
Given 6 points and K=2:
Iteration 1
Randomly initialize centroids, assign points to nearest centroid
Update
Recompute centroids as cluster means, repeat until convergence
Convergence: Algorithm stops when assignments no longer change or max iterations reached.
Density-Based Spatial Clustering of Applications with Noise:
Parameters
ε (neighborhood radius), MinPts (minimum points)
Key Concepts
Core points, border points, noise points
Advantages: No need to specify K, can find arbitrary shapes, identifies outliers
Probabilistic clustering assuming data comes from mixture of Gaussians:
EM Algorithm
E-step: compute responsibilities; M-step: update parameters
Soft Clustering
Probabilistic assignment
Uses eigenvalues of similarity matrix:
Use case: Non-convex clusters, image segmentation, community detection
Silhouette Coefficient
Range: [-1, 1]; higher is better
Calinski-Harabasz Index
Higher is better
Davies-Bouldin Index
Average similarity between clusters; lower is better
Gap Statistic
Compare within-cluster dispersion to null reference
Adjusted Rand Index (ARI)
Measures agreement adjusted for chance; range [-1, 1]
Normalized Mutual Information (NMI)
Information-theoretic measure; range [0, 1]
Elbow Method
Plot WSS vs K; look for "elbow" where decline slows
Silhouette Analysis
Choose K with highest average silhouette width
Gap Statistic
Choose smallest K where Gap(K) ≥ Gap(K+1) - SE
Information Criteria
BIC/AIC for model-based clustering (GMM)
Customer Segmentation
Group customers by behavior for targeted marketing
Image Segmentation
Partition images into regions by color/texture
Document Clustering
Group documents by topic for information retrieval
Genomics
Gene expression clustering, species classification
Data Preparation
Method Selection
Remember: Clustering is exploratory. Results should be validated with domain knowledge and multiple methods.
R
kmeans(), hclust(), dbscan::dbscan()
Python
sklearn.cluster (KMeans, DBSCAN, etc.)
SPSS
Analyze → Classify → K-Means/Hierarchical
SAS
PROC CLUSTER, PROC FASTCLUS
After clustering, characterize each cluster:
Descriptive Statistics
Mean, median, SD for each variable by cluster
Visualization
Parallel coordinates, radar charts, cluster plots
Silhouette Coefficient
Range: [-1, 1]; Higher is better
Davies-Bouldin Index
Lower values indicate better clustering
Density-based clustering that can find arbitrarily shaped clusters:
Probabilistic model-based clustering:
Advantages
Soft clustering; principled probability model
Estimation
EM algorithm for parameter estimation
For clusters C₁ and C₂, linkage distances are:
Single Linkage
Complete Linkage
Average Linkage
Ward's Method
The objective function decreases monotonically:
Guarantee
Always converges to local minimum
Issue
May not find global minimum; run multiple times
Given: 6 points, K=2
A(1,1), B(1.5,2), C(3,4), D(5,7), E(3.5,5), F(4.5,5)
Initialize: μ₁=(1,1), μ₂=(5,7)
Iteration 1 - Assign:
C₁ = {A,B,C}, C₂ = {D,E,F}
Update centroids:
Result: After convergence (typically 3-5 iterations), clusters stabilize with final WSS = 14.22
Distance matrix for 5 points:
Complete linkage steps:
Smart initialization for better convergence:
Advantage
Spreads initial centroids; O(log K) competitive ratio
Performance
Often converges faster than random initialization
Assess cluster reliability through resampling:
Jaccard Index
Interpretation
Measures overlap between cluster pairs
Valid distance metrics must satisfy:
Non-negativity
Identity
Symmetry
Triangle Inequality
Euclidean
Best for continuous data with similar scales
Manhattan
Robust to outliers; grid-like spaces
Mahalanobis
Accounts for correlations; scale-invariant
Cosine
Text data; direction more important than magnitude
Compare within-cluster dispersion to null reference distribution:
W_K
Within-cluster sum of squares for K clusters
E*_n
Expected value under null (uniform) distribution
Selection rule: Choose smallest K where Gap(K) ≥ Gap(K+1) - SE(K+1)
Assume data generated from mixture of K Gaussian distributions:
Parameters to estimate:
Mixing Weights
Means
Covariances
E-step: Compute responsibilities (posterior probabilities)
M-step: Update parameters
Choose K that minimizes Bayesian Information Criterion:
Log-Likelihood
Parameters
ν = number of free parameters
Advantage: Balances model fit with complexity; penalizes overfitting
Agglomerative (Bottom-Up)
Divisive (Top-Down)
Reading a dendrogram:
Height
Distance at which clusters merge; taller = more dissimilar
Cutting
Horizontal cut at height h gives partition into clusters
Large Gaps
Cut where vertical distance is largest
Balance
Unbalanced tree may indicate outliers or poor linkage choice
K-Means
T iterations typically < 100
Hierarchical
Can be O(n³) for some linkages
DBSCAN
With spatial index; O(n²) without
Usually yes, especially if variables have different scales. Otherwise, high-variance variables dominate distance calculations.
Hierarchical is good for exploring structure and doesn't require specifying K upfront. K-means is faster for large datasets but needs K specified.
K-means assumes spherical clusters. Consider DBSCAN, GMM, or spectral clustering for non-spherical clusters.
Use multiple criteria: elbow method, silhouette scores, gap statistic, and domain knowledge. No single best method - consider interpretability.
Not directly with Euclidean distance. Use Gower distance, convert to dummy variables, or use specialized methods like k-modes.