Master density-based clustering with DBSCAN. Learn how to identify clusters based on sample density, handle non-spherical clusters, and automatically detect noise points.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups samples in high-density regions and identifies outliers as noise. Unlike k-means, DBSCAN can find clusters of arbitrary shape and doesn't require specifying the number of clusters.
Clusters are formed by connecting samples that are "density-reachable" from core objects (samples in high-density regions).
Key advantages:
Understanding these concepts is essential for DBSCAN:
A sample x is a core object if its ε-neighborhood contains at least MinPts samples (including itself).
|N_ε(x)| >= MinPts
Core objects are in high-density regions and form the "backbone" of clusters.
Sample xⱼ is density-reachable from core object xᵢ if there exists a chain of core objects connecting them.
xⱼ is density-reachable from xᵢ if there exists a sequence p₁, p₂, ..., pₙ where p₁ = xᵢ, pₙ = xⱼ, and each subsequent point in the sequence is in the ε-neighborhood of the previous core object.
Two samples xᵢ and xⱼ are density-connected if there exists a core object xₖ such that both xᵢ and xⱼ are density-reachable from xₖ.
Density-connected samples belong to the same cluster.
Choosing appropriate ε and MinPts is crucial for DBSCAN performance:
Use the k-distance graph method:
Common choices: