Master the most popular clustering algorithm. Learn how k-means groups data into k clusters by iteratively updating cluster centers to minimize within-cluster variance.
K-means is the most widely used clustering algorithm. It partitions data into k clusters by minimizing the sum of squared distances between samples and their assigned cluster centers. The algorithm uses the mean vector of each cluster as the prototype.
Where:
The k-means algorithm follows a simple iterative process:
Randomly select k samples from the dataset as initial cluster centers (prototypes).
Details: Each center represents one cluster. Common initialization: random selection or k-means++ method.
For each sample, calculate distance to all k cluster centers and assign to the nearest center.
Details: This creates the current cluster partition. Each sample belongs to exactly one cluster (hard clustering).
For each cluster, calculate the mean vector of all samples in that cluster. This becomes the new cluster center.
Details: The mean vector minimizes the sum of squared distances within the cluster (centroid property).
If cluster centers haven't changed (or changed less than threshold), stop. Otherwise, return to step 2.
Details: Convergence is guaranteed (algorithm will eventually stop), but may reach local optimum.
The choice of initial cluster centers significantly affects k-means results. Poor initialization can lead to local optima or slow convergence.
Randomly select k samples from the dataset as initial centers. Simple but can lead to poor results.
Problems:
Recommended method. Selects initial centers to be far apart, improving convergence and final cluster quality.
Algorithm:
Advantage: Centers are well-spread, reducing chance of poor local optima. Typically requires fewer iterations to converge.
K-means requires specifying k (number of clusters) beforehand. The elbow method is a common technique for choosing k.
Plot the within-cluster sum of squares (WCSS) or inertia against k. Look for the "elbow" - the point where increasing k no longer significantly reduces WCSS.
As k increases, WCSS decreases. The elbow is where the rate of decrease slows significantly.
k = 1: WCSS = 10,000
All samples in one cluster
k = 2: WCSS = 3,500
Large decrease (6,500 reduction)
k = 3: WCSS = 1,200
Large decrease (2,300 reduction)
k = 4: WCSS = 800
Elbow point: Small decrease (400 reduction) - optimal k
k = 5: WCSS = 600
Small decrease (200 reduction)
In this example, k=4 is the optimal choice as increasing k beyond 4 provides diminishing returns.
Apply k-means to segment 200 e-commerce customers based on their behavior:
| ID | Age | Income | Spending | Visits | Cluster |
|---|---|---|---|---|---|
| 1 | 28 | $45,000 | $3,200 | 12 | Cluster 1 |
| 2 | 45 | $85,000 | $8,500 | 8 | Cluster 2 |
| 3 | 22 | $28,000 | $1,200 | 15 | Cluster 1 |
| 4 | 52 | $120,000 | $12,000 | 6 | Cluster 2 |
| 5 | 35 | $65,000 | $4,800 | 10 | Cluster 1 |
| 6 | 19 | $22,000 | $800 | 18 | Cluster 1 |
| 7 | 48 | $95,000 | $9,800 | 7 | Cluster 2 |
| 8 | 31 | $55,000 | $3,800 | 11 | Cluster 1 |
After k-means clustering (k=2), customers are grouped into two segments based on their purchasing behavior.
K-medoids (also called PAM - Partitioning Around Medoids) is a variant of k-means that uses actual data points (medoids) instead of mean vectors as cluster centers. This makes it more robust to outliers.