MathIsimple

DBSCAN Density Clustering

Master density-based clustering with DBSCAN. Learn how to identify clusters based on sample density, handle non-spherical clusters, and automatically detect noise points.

Module 8 of 9
Intermediate to Advanced
80-100 min

What is DBSCAN?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups samples in high-density regions and identifies outliers as noise. Unlike k-means, DBSCAN can find clusters of arbitrary shape and doesn't require specifying the number of clusters.

Core Principle

Clusters are formed by connecting samples that are "density-reachable" from core objects (samples in high-density regions).

Key advantages:

  • • No need to specify number of clusters
  • • Can find clusters of arbitrary shape
  • • Automatically identifies noise/outliers
  • • Robust to outliers

Parameters

  • ε (epsilon): Neighborhood radius
  • MinPts: Minimum points for core object

Output

  • • Cluster assignments
  • • Noise point identification
  • • Number of clusters (discovered)

Core Concepts

Understanding these concepts is essential for DBSCAN:

Core Object

A sample x is a core object if its ε-neighborhood contains at least MinPts samples (including itself).

|N_ε(x)| >= MinPts

Core objects are in high-density regions and form the "backbone" of clusters.

Density-Reachable

Sample xⱼ is density-reachable from core object xᵢ if there exists a chain of core objects connecting them.

xⱼ is density-reachable from xᵢ if there exists a sequence p₁, p₂, ..., pₙ where p₁ = xᵢ, pₙ = xⱼ, and each subsequent point in the sequence is in the ε-neighborhood of the previous core object.

Density-Connected

Two samples xᵢ and xⱼ are density-connected if there exists a core object xₖ such that both xᵢ and xⱼ are density-reachable from xₖ.

Density-connected samples belong to the same cluster.

Parameter Selection

Choosing appropriate ε and MinPts is crucial for DBSCAN performance:

ε (Epsilon) Selection

Use the k-distance graph method:

  1. 1. For each sample, find distance to k-th nearest neighbor
  2. 2. Plot sorted k-distances
  3. 3. Choose ε at the "knee" (sharp increase point)

MinPts Selection

Common choices:

  • MinPts = 4: For 2D data
  • MinPts = dim + 1: For higher dimensions
  • MinPts = 5: General default