MathIsimple

K-Means Clustering

Master the most popular clustering algorithm. Learn how k-means groups data into k clusters by iteratively updating cluster centers to minimize within-cluster variance.

Module 5 of 9
Intermediate Level
80-100 min

What is K-Means Clustering?

K-means is the most widely used clustering algorithm. It partitions data into k clusters by minimizing the sum of squared distances between samples and their assigned cluster centers. The algorithm uses the mean vector of each cluster as the prototype.

Objective Function

Minimize: E=i=1kxCixμi2\text{Minimize: } E = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2

Where:

  • k = number of clusters
  • CiC_i = set of samples in cluster i
  • μi\mu_i = mean vector (centroid) of cluster i
  • xμi2\|x - \mu_i\|^2 = squared Euclidean distance

Advantages

  • • Simple and easy to understand
  • • Fast and efficient (O(nkd) per iteration)
  • • Works well for spherical clusters
  • • Scales to large datasets

Limitations

  • • Requires specifying k (number of clusters)
  • • Sensitive to initial centers
  • • Assumes spherical clusters
  • • May converge to local optimum

K-Means Algorithm Steps

The k-means algorithm follows a simple iterative process:

Step 1

Initialize Cluster Centers

Randomly select k samples from the dataset as initial cluster centers (prototypes).

Details: Each center represents one cluster. Common initialization: random selection or k-means++ method.

Step 2

Assign Samples to Clusters

For each sample, calculate distance to all k cluster centers and assign to the nearest center.

Details: This creates the current cluster partition. Each sample belongs to exactly one cluster (hard clustering).

Step 3

Update Cluster Centers

For each cluster, calculate the mean vector of all samples in that cluster. This becomes the new cluster center.

Details: The mean vector minimizes the sum of squared distances within the cluster (centroid property).

Step 4

Check Convergence

If cluster centers haven't changed (or changed less than threshold), stop. Otherwise, return to step 2.

Details: Convergence is guaranteed (algorithm will eventually stop), but may reach local optimum.

Initialization Methods

The choice of initial cluster centers significantly affects k-means results. Poor initialization can lead to local optima or slow convergence.

Random Initialization

Randomly select k samples from the dataset as initial centers. Simple but can lead to poor results.

Problems:

  • • Centers may be too close together
  • • May miss important clusters
  • • Requires multiple runs to find good solution

K-Means++ Initialization

Recommended method. Selects initial centers to be far apart, improving convergence and final cluster quality.

Algorithm:

  1. 1. Randomly select first center
  2. 2. For each remaining center (i = 2 to k):
  3. • Calculate distance from each sample to nearest existing center
  4. • Select next center with probability proportional to distance²

Advantage: Centers are well-spread, reducing chance of poor local optima. Typically requires fewer iterations to converge.

Selecting the Number of Clusters (k)

K-means requires specifying k (number of clusters) beforehand. The elbow method is a common technique for choosing k.

Elbow Method

Plot the within-cluster sum of squares (WCSS) or inertia against k. Look for the "elbow" - the point where increasing k no longer significantly reduces WCSS.

WCSS(k)=i=1kxCixμi2\text{WCSS}(k) = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2

As k increases, WCSS decreases. The elbow is where the rate of decrease slows significantly.

Example: Elbow Plot Interpretation

k = 1: WCSS = 10,000

All samples in one cluster

k = 2: WCSS = 3,500

Large decrease (6,500 reduction)

k = 3: WCSS = 1,200

Large decrease (2,300 reduction)

k = 4: WCSS = 800

Elbow point: Small decrease (400 reduction) - optimal k

k = 5: WCSS = 600

Small decrease (200 reduction)

In this example, k=4 is the optimal choice as increasing k beyond 4 provides diminishing returns.

Customer Segmentation Example

Apply k-means to segment 200 e-commerce customers based on their behavior:

Dataset: Customer Behavior

IDAgeIncomeSpendingVisitsCluster
128$45,000$3,20012
Cluster 1
245$85,000$8,5008
Cluster 2
322$28,000$1,20015
Cluster 1
452$120,000$12,0006
Cluster 2
535$65,000$4,80010
Cluster 1
619$22,000$80018
Cluster 1
748$95,000$9,8007
Cluster 2
831$55,000$3,80011
Cluster 1

After k-means clustering (k=2), customers are grouped into two segments based on their purchasing behavior.

Cluster 1: Budget-Conscious

  • • Lower income ($22k-$65k)
  • • Lower spending ($800-$4,800)
  • • More frequent visits (10-18/month)
  • • Younger age (19-35 years)

Cluster 2: Affluent

  • • Higher income ($85k-$120k)
  • • Higher spending ($8,500-$12,000)
  • • Fewer visits (6-8/month)
  • • Older age (45-52 years)

K-Medoids Variant

K-medoids (also called PAM - Partitioning Around Medoids) is a variant of k-means that uses actual data points (medoids) instead of mean vectors as cluster centers. This makes it more robust to outliers.

K-Means (Mean)

  • • Uses mean vector (may not be a real sample)
  • • Sensitive to outliers
  • • Faster computation
  • • Works well for continuous data

K-Medoids (Actual Sample)

  • • Uses actual data point (medoid)
  • • Robust to outliers
  • • Slower computation (O(n²k))
  • • Works for any distance metric