Machine Learning/Learning Center/Clustering/Prototype Clustering/K-Means

K-Means Clustering

Master the most popular clustering algorithm. Learn how k-means groups data into k clusters by iteratively updating cluster centers to minimize within-cluster variance.

Module 5 of 9

Intermediate Level

80-100 min

What is K-Means Clustering?

K-means is the most widely used clustering algorithm. It partitions data into k clusters by minimizing the sum of squared distances between samples and their assigned cluster centers. The algorithm uses the mean vector of each cluster as the prototype.

Objective Function

\text{Minimize: } E = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2

Where:

• k = number of clusters
• $C_i$ = set of samples in cluster i
• $\mu_i$ = mean vector (centroid) of cluster i
• $\|x - \mu_i\|^2$ = squared Euclidean distance

Advantages

• Simple and easy to understand
• Fast and efficient (O(nkd) per iteration)
• Works well for spherical clusters
• Scales to large datasets

Limitations

• Requires specifying k (number of clusters)
• Sensitive to initial centers
• Assumes spherical clusters
• May converge to local optimum

K-Means Algorithm Steps

The k-means algorithm follows a simple iterative process:

Step 1

Initialize Cluster Centers

Randomly select k samples from the dataset as initial cluster centers (prototypes).

Details: Each center represents one cluster. Common initialization: random selection or k-means++ method.

Step 2

Assign Samples to Clusters

For each sample, calculate distance to all k cluster centers and assign to the nearest center.

Details: This creates the current cluster partition. Each sample belongs to exactly one cluster (hard clustering).

Step 3

Update Cluster Centers

For each cluster, calculate the mean vector of all samples in that cluster. This becomes the new cluster center.

Details: The mean vector minimizes the sum of squared distances within the cluster (centroid property).

Step 4

Check Convergence

If cluster centers haven't changed (or changed less than threshold), stop. Otherwise, return to step 2.

Details: Convergence is guaranteed (algorithm will eventually stop), but may reach local optimum.

Initialization Methods

The choice of initial cluster centers significantly affects k-means results. Poor initialization can lead to local optima or slow convergence.

Random Initialization

Randomly select k samples from the dataset as initial centers. Simple but can lead to poor results.

Problems:

• Centers may be too close together
• May miss important clusters
• Requires multiple runs to find good solution

K-Means++ Initialization

Recommended method. Selects initial centers to be far apart, improving convergence and final cluster quality.

Algorithm:

1. Randomly select first center
2. For each remaining center (i = 2 to k):
• Calculate distance from each sample to nearest existing center
• Select next center with probability proportional to distance²

Advantage: Centers are well-spread, reducing chance of poor local optima. Typically requires fewer iterations to converge.

Selecting the Number of Clusters (k)

K-means requires specifying k (number of clusters) beforehand. The elbow method is a common technique for choosing k.

Elbow Method

Plot the within-cluster sum of squares (WCSS) or inertia against k. Look for the "elbow" - the point where increasing k no longer significantly reduces WCSS.

\text{WCSS}(k) = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2

As k increases, WCSS decreases. The elbow is where the rate of decrease slows significantly.

Example: Elbow Plot Interpretation

k = 1: WCSS = 10,000

All samples in one cluster

k = 2: WCSS = 3,500

Large decrease (6,500 reduction)

k = 3: WCSS = 1,200

Large decrease (2,300 reduction)

k = 4: WCSS = 800

Elbow point: Small decrease (400 reduction) - optimal k

k = 5: WCSS = 600

Small decrease (200 reduction)

In this example, k=4 is the optimal choice as increasing k beyond 4 provides diminishing returns.

Customer Segmentation Example

Apply k-means to segment 200 e-commerce customers based on their behavior:

Dataset: Customer Behavior

ID	Age	Income	Spending	Visits	Cluster
1	28	$45,000	$3,200	12	Cluster 1
2	45	$85,000	$8,500	8	Cluster 2
3	22	$28,000	$1,200	15	Cluster 1
4	52	$120,000	$12,000	6	Cluster 2
5	35	$65,000	$4,800	10	Cluster 1
6	19	$22,000	$800	18	Cluster 1
7	48	$95,000	$9,800	7	Cluster 2
8	31	$55,000	$3,800	11	Cluster 1

After k-means clustering (k=2), customers are grouped into two segments based on their purchasing behavior.

Cluster 1: Budget-Conscious

• Lower income ($22k-$65k)
• Lower spending ($800-$4,800)
• More frequent visits (10-18/month)
• Younger age (19-35 years)

Cluster 2: Affluent

• Higher income ($85k-$120k)
• Higher spending ($8,500-$12,000)
• Fewer visits (6-8/month)
• Older age (45-52 years)

K-Medoids Variant

K-medoids (also called PAM - Partitioning Around Medoids) is a variant of k-means that uses actual data points (medoids) instead of mean vectors as cluster centers. This makes it more robust to outliers.

K-Means (Mean)

• Uses mean vector (may not be a real sample)
• Sensitive to outliers
• Faster computation
• Works well for continuous data

K-Medoids (Actual Sample)

• Uses actual data point (medoid)
• Robust to outliers
• Slower computation (O(n²k))
• Works for any distance metric

Next Algorithm