Machine Learning/Learning Center/Clustering/Density Clustering

DBSCAN Density Clustering

Master density-based clustering with DBSCAN. Learn how to identify clusters based on sample density, handle non-spherical clusters, and automatically detect noise points.

Module 8 of 9

Intermediate to Advanced

80-100 min

What is DBSCAN?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups samples in high-density regions and identifies outliers as noise. Unlike k-means, DBSCAN can find clusters of arbitrary shape and doesn't require specifying the number of clusters.

Core Principle

Clusters are formed by connecting samples that are "density-reachable" from core objects (samples in high-density regions).

Key advantages:

• No need to specify number of clusters
• Can find clusters of arbitrary shape
• Automatically identifies noise/outliers
• Robust to outliers

Parameters

• ε (epsilon): Neighborhood radius
• MinPts: Minimum points for core object

Output

• Cluster assignments
• Noise point identification
• Number of clusters (discovered)

Core Concepts

Understanding these concepts is essential for DBSCAN:

Core Object

A sample x is a core object if its ε-neighborhood contains at least MinPts samples (including itself).

|N_ε(x)| >= MinPts

Core objects are in high-density regions and form the "backbone" of clusters.

Density-Reachable

Sample xⱼ is density-reachable from core object xᵢ if there exists a chain of core objects connecting them.

xⱼ is density-reachable from xᵢ if there exists a sequence p₁, p₂, ..., pₙ where p₁ = xᵢ, pₙ = xⱼ, and each subsequent point in the sequence is in the ε-neighborhood of the previous core object.

Density-Connected

Two samples xᵢ and xⱼ are density-connected if there exists a core object xₖ such that both xᵢ and xⱼ are density-reachable from xₖ.

Density-connected samples belong to the same cluster.

Parameter Selection

Choosing appropriate ε and MinPts is crucial for DBSCAN performance:

ε (Epsilon) Selection

Use the k-distance graph method:

1. For each sample, find distance to k-th nearest neighbor
2. Plot sorted k-distances
3. Choose ε at the "knee" (sharp increase point)

MinPts Selection

Common choices:

• MinPts = 4: For 2D data
• MinPts = dim + 1: For higher dimensions
• MinPts = 5: General default

Next: Hierarchical Clustering