Learn hierarchical clustering methods including AGNES (agglomerative) and DIANA (divisive). Understand dendrograms, cluster distance measures, and flexible cluster number selection.
Hierarchical clustering creates a tree-like structure (dendrogram) representing clusters at different levels of granularity. Unlike k-means, hierarchical clustering doesn't require specifying the number of clusters beforehand - you can choose the appropriate level after clustering.
AGNES (Agglomerative)
Bottom-up: Start with each sample as a cluster, iteratively merge closest clusters
DIANA (Divisive)
Top-down: Start with all samples in one cluster, iteratively split clusters
AGNES is a bottom-up approach that starts with each sample as its own cluster and iteratively merges the closest clusters:
Start with m clusters (each sample is its own cluster).
Calculate distance between all pairs of clusters and find the closest pair.
Merge the two closest clusters into one. Update cluster distances.
Repeat steps 2-3 until all samples are in one cluster. Build dendrogram during process.
The choice of how to measure distance between clusters (linkage criterion) significantly affects results:
Distance between clusters = minimum distance between any two points in different clusters.
d_min(Cᵢ, Cⱼ) = min dist(x, y) where x in Cᵢ, y in Cⱼ
Tends to create elongated clusters (chaining effect).
Distance between clusters = maximum distance between any two points in different clusters.
d_max(Cᵢ, Cⱼ) = max dist(x, y) where x in Cᵢ, y in Cⱼ
Tends to create compact, spherical clusters.
Distance between clusters = average distance between all pairs of points in different clusters.
d_avg(Cᵢ, Cⱼ) = (1/|Cᵢ||Cⱼ|) Σ dist(x, y) where x in Cᵢ, y in Cⱼ
Balanced approach, commonly used in practice.
DIANA is a top-down approach that starts with all samples in one cluster and iteratively splits clusters:
Algorithm:
Note: DIANA is computationally more expensive than AGNES (O(n²) vs O(n² log n) for AGNES with efficient implementations). Typically used for smaller datasets.
A dendrogram is a tree diagram showing the hierarchical relationship between clusters. The height of branches represents the distance at which clusters were merged.