Unsupervised Learning

Beginner to Intermediate

Clustering Basics

Master fundamental clustering algorithms: k-means, hierarchical clustering, and DBSCAN

K-Means Clustering

Partitioning into K Clusters

Algorithm

1. Initialize: Choose initial centroids (randomly or k-means++)
2. Assignment: Assign each point to nearest centroid:
$c_i = \arg\min_{k} ||x_i - \mu_k||^2$
3. Update: Recalculate centroids as mean of assigned points:
$\mu_k = \frac{1}{|C_k|}\sum_{x_i \in C_k} x_i$
4. Repeat: Steps 2-3 until convergence (centroids don't change)

Advantages

• Simple and fast:
• Scales well to large datasets
• Guaranteed to converge
• Works well for spherical clusters

Limitations

• Must specify beforehand
• Sensitive to initialization
• Assumes spherical, equal-sized clusters
• Sensitive to outliers
• Only works with numerical data

K-Means++ Initialization

Better initialization strategy to avoid poor local minima:

1. Choose first centroid uniformly at random
2. For each remaining point, compute distance to nearest chosen centroid
3. Choose next centroid with probability proportional to (farther points more likely)
4. Repeat until centroids chosen

Hierarchical Clustering

Building a Dendrogram

Agglomerative (Bottom-Up) Algorithm

1. Start: Each point is its own cluster ( clusters)
2. Merge: Find two closest clusters and merge them
3. Repeat: Until only one cluster remains
4. Dendrogram: Tree showing merge history. Cut at desired height to get clusters

Linkage Methods (Distance Between Clusters)

Single Linkage (Min)

d(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x,y)

✗ Chain effect (long, thin clusters)

Complete Linkage (Max)

d(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x,y)

✓ Compact clusters, less sensitive to outliers

Average Linkage

d(C_i, C_j) = \frac{1}{|C_i||C_j|}\sum_{x \in C_i}\sum_{y \in C_j} d(x,y)

✓ Balanced, popular choice

Centroid/Ward Linkage

Distance between centroids or minimize within-cluster variance

✓ Often best performance

Advantages

• No need to specify beforehand
• Dendrogram provides hierarchy
• Deterministic (no random initialization)
• Can use any distance metric

Limitations

• or time complexity
• Doesn't scale to very large datasets
• Merge decisions are irreversible
• Sensitive to noise/outliers

DBSCAN

Density-Based Spatial Clustering

Key Concepts

Core Point

≥ MinPts neighbors within

Border Point

< MinPts neighbors but reachable from core

Noise Point

Neither core nor border

Algorithm

1. For each unvisited point :
• If is core point: Start new cluster, expand by adding all density-reachable points
• Else: Mark as noise (may become border later)
2. Two core points in same -neighborhood belong to same cluster
3. Border points assigned to nearest core point's cluster

Parameters

(eps)

Neighborhood radius. Use k-distance plot to choose: plot distance to -th nearest neighbor, look for 'elbow'

MinPts

Minimum points to form dense region. Rule of thumb: MinPts ≥ dimensionality + 1, often 5-10 works well

Advantages

• No need to specify
• Finds arbitrary-shaped clusters
• Robust to outliers (marks as noise)
• Only 2 parameters (often intuitive)

Limitations

• Struggles with varying densities
• Sensitive to and MinPts choice
• worst case (can improve with indexing)
• High-dimensional data challenging

Evaluating Clusters

Internal Metrics (No Labels)

Silhouette Coefficient

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

= avg dist to same cluster, = avg dist to nearest cluster. Range [-1, 1], higher better.

Davies-Bouldin Index

Ratio of within-cluster to between-cluster distances. Lower is better.

Calinski-Harabasz Index

Ratio of between-cluster variance to within-cluster variance. Higher is better.

External Metrics (With Labels)

Adjusted Rand Index (ARI)

Similarity between true and predicted clustering. Range [-1, 1], 1 = perfect match, 0 = random.

Normalized Mutual Information (NMI)

Mutual information between cluster and class labels, normalized. Range [0, 1], 1 = perfect.

Purity

Fraction of correctly clustered points (assuming majority label per cluster). Range [0, 1].

Example: Customer Segmentation

Clustering 1000 Customers by Purchase Behavior

Features (Standardized)

• Annual spending, • Purchase frequency, • Average order value, • Days since last purchase

Method Comparison

Method	Clusters Found	Silhouette Score	Insights
K-means (k=4)	4	0.52	VIP, Regular, Occasional, Inactive
Hierarchical (Ward)	5	0.48	More granular segments
DBSCAN	3 + noise	0.55	Identified 50 outliers (fraud?)

Business Value

DBSCAN's ability to detect outliers revealed potentially fraudulent accounts. K-means provided clean segments for targeted marketing campaigns. Each method offers different insights!

View More Examples & Exercises

K-Means Convergence Proof

Lloyd's Algorithm Always Converges

Objective Function (WCSS)

Within-Cluster Sum of Squares:

J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

K-means minimizes J, but finding global optimum is NP-hard.

Convergence Theorem

Theorem: K-means algorithm converges in finite iterations.

Proof:

Step 1 (Assignment step decreases J):

Assigning x_i to nearest centroid μ_k minimizes ||x_i - μ_k||². Therefore J(new assignments) ≤ J(old).

Step 2 (Update step decreases J):

New centroid μ_k = (1/|C_k|)Σ_(x∈C_k) x minimizes Σ||x - μ_k||² (sample mean minimizes squared distances).

Step 3 (Monotonic decrease):

J decreases at each step (or stays same). J is bounded below by 0.

Step 4 (Finite partitions):

Only finitely many ways to partition n points into K clusters. Since J strictly decreases unless converged, cannot revisit same partition. Must converge in finite steps. ∎

Time Complexity Analysis

Per iteration:

• Assignment: For each of n points, compute distance to K centroids → O(nKd)
• Update: Sum d-dimensional points in each cluster → O(nd)
• Per iteration total: O(nKd)

Total complexity:

O(I \cdot nKd)

where I = number of iterations (typically <100)

K-Means++ Initialization Analysis

Probabilistic Seeding for Better Solutions

Algorithm

Choose first center μ_1 uniformly at random
For each point x, compute D(x) = distance to nearest chosen center
Choose next center μ_i with probability proportional to D(x)²
Repeat 2-3 until K centers chosen
Run standard K-means

Theoretical Guarantee

Theorem (Arthur & Vassilvitskii, 2007):

\mathbb{E}[J_{\text{K-means++}}] \leq 8(\ln K + 2) \cdot J_{\text{opt}}

Expected solution is O(log K)-competitive with optimal! Much better than random initialization.

Why D(x)² Weighting Works

Intuition: Points far from existing centers have higher chance of being selected.

• Spreads out initial centers
• D(x)² gives preference to outliers (avoids crowding)
• Balances exploration (far points) vs exploitation (dense regions)

DBSCAN: Formal Definitions

Density-Based Spatial Clustering

Core Definitions

ε-Neighborhood:

N_\epsilon(p) = \{q \in D : \text{dist}(p,q) \leq \epsilon\}

Set of points within distance ε of p

Core Point:

Point p is a core point if |N_ε(p)| ≥ MinPts

Directly Density-Reachable:

q is directly density-reachable from p if:

1. q ∈ N_ε(p)
2. p is a core point

Density-Reachable (transitive):

q is density-reachable from p if there exists chain p_1,...,p_n where:

p_1 = p, p_n = q, and p_(i+1) is directly density-reachable from p_i

Density-Connected:

p and q are density-connected if ∃ core point o such that both p and q are density-reachable from o

Cluster Definition

Formal Definition: A cluster C w.r.t. ε and MinPts satisfies:

Maximality: ∀p, q: if p ∈ C and q is density-reachable from p, then q ∈ C
Connectivity: ∀p, q ∈ C: p and q are density-connected

Time Complexity

Naive: For each point, find ε-neighborhood → O(n²)

With spatial index (R*-tree, KD-tree): O(n log n)

Algorithm steps:

1. Find all ε-neighborhoods: O(n log n) with index
2. Identify core points: O(n)
3. Connect core points: O(n) with union-find
4. Assign border points: O(n)

Silhouette Coefficient Derivation

Measuring Clustering Quality

Definition for Single Point

For point i in cluster C_I:

a(i) - Average distance to points in same cluster:

a(i) = \frac{1}{|C_I| - 1} \sum_{j \in C_I, j \neq i} d(i,j)

b(i) - Average distance to nearest other cluster:

b(i) = \min_{J \neq I} \frac{1}{|C_J|} \sum_{j \in C_J} d(i,j)

Silhouette coefficient:

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Interpretation

s(i) ≈ 1

a(i) << b(i)
Well-clustered

s(i) ≈ 0

a(i) ≈ b(i)
On border

s(i) ≈ -1

a(i) > b(i)
Mis-clustered

Average Silhouette Score

S = \frac{1}{n}\sum_{i=1}^{n} s(i)

Used to evaluate clustering quality and choose optimal K. Higher is better.

Complexity:

O(n²) - need to compute all pairwise distances

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

Questions

Correct

Accuracy

What is the main goal of clustering?

Not attempted

In k-means, what does the 'assignment step' do?

Not attempted

K-means minimizes which objective?

Not attempted

What is a limitation of k-means?

Not attempted

In hierarchical agglomerative clustering, which linkage method computes distance between cluster centroids?

Not attempted

What does DBSCAN stand for?

Not attempted

In DBSCAN, a point is a 'core point' if:

Not attempted

Which metric measures clustering quality using cohesion and separation?

Not attempted

The elbow method for choosing in k-means looks for:

Not attempted

Which clustering algorithm naturally handles noise and outliers?

Not attempted