Master distance calculation methods including Minkowski distance, VDM for discrete attributes, and MinkovDM for mixed data types
Clustering is fundamentally about "grouping by distance". The choice of distance metric directly determines which samples are considered similar and how clusters are formed. A poor distance metric will lead to poor clustering results, regardless of the algorithm used.
Samples with small distances are grouped into the same cluster, while samples with large distances are placed in different clusters.
Small dist(xᵢ, xⱼ) → Same cluster
Large dist(xᵢ, xⱼ) → Different clusters
Different distance metrics can produce completely different clustering results on the same dataset
Continuous attributes, discrete attributes, and mixed data require different distance calculation methods
A valid distance metric must satisfy four fundamental properties. These properties ensure that distance behaves intuitively and mathematically consistently.
Mathematical Definition:
Distance between any two samples is always non-negative (zero or positive)
Example: Distance from New York to Boston is always >= 0
Mathematical Definition:
Distance is zero if and only if the two samples are identical
Example: A sample's distance to itself is always 0
Mathematical Definition:
Distance from xᵢ to xⱼ equals distance from xⱼ to xᵢ
Example: Distance from A to B equals distance from B to A
Mathematical Definition:
Direct distance is never longer than going through an intermediate point
Example: Direct route is shortest; detours always add distance
The Minkowski distance is a generalized distance metric for continuous attributes. It's a parameterized family of distances that includes Euclidean and Manhattan distances as special cases.
Where:
Formula:
The most commonly used distance metric. Measures straight-line distance in n-dimensional space.
Best for:
Formula:
Also called L1 distance or taxicab distance. Sum of absolute differences along each dimension.
Best for:
Consider two customers with features [Age, Income, Spending]:
Customer 1:
[28, 45000, 3200]
Customer 2:
[35, 65000, 4800]
Differences:
[7, 20000, 1600]
dist = √[(28-35)² + (45000-65000)² + (3200-4800)²]
dist = √[49 + 400,000,000 + 2,560,000]
dist = √402,560,049 ≈ 20,064
Note: In practice, features should be normalized to similar scales before calculating distance, otherwise large-scale features (like income) will dominate the distance calculation.
The VDM (Value Difference Metric) is designed for unordered discrete attributes(categorical attributes without inherent ordering), such as color, category, or brand name.
For discrete attributes like "color" (Red, Blue, Green), we can't use simple subtraction. Instead, VDM measures distance based on how differently the attribute values are distributed across clusters.
Key Insight:
Two attribute values are similar if they appear in similar proportions across different clusters. If "Red" and "Blue" both appear 30% in Cluster 1 and 70% in Cluster 2, they're considered similar.
Where:
Consider a "Color" attribute with values "Red" and "Blue" across 3 clusters:
Distribution:
VDM_1(Red, Blue) = |20/40 - 5/40| + |10/40 - 15/40| + |10/40 - 20/40|
= |0.5 - 0.125| + |0.25 - 0.375| + |0.25 - 0.5|
= 0.375 + 0.125 + 0.25 = 0.75
Lower VDM values indicate more similar attribute values (similar cluster distributions).
The MinkovDM distance combines Minkowski distance for continuous attributes and VDM distance for discrete attributes, enabling distance calculation for mixed attribute data.
Where:
Consider customer data with both continuous and discrete attributes:
| ID | Age (cont.) | Income (cont.) | Color (disc.) | Spending (cont.) |
|---|---|---|---|---|
| 1 | 28 | $45,000 | Blue | $3,200 |
| 2 | 35 | $65,000 | Red | $4,800 |
| 3 | 22 | $28,000 | Blue | $1,200 |
Distance Calculation:
For Customer 1 vs Customer 2 (p=2, Euclidean):
Continuous: √[(28-35)² + (45000-65000)² + (3200-4800)²]
Discrete: VDM_2(Blue, Red)
MinkovDM = [Continuous² + Discrete²]^(1/2)
Use Minkowski distance:
Use VDM distance:
Use MinkovDM: