Machine Learning/Learning Center/Clustering/Distance Measures

Distance Measures for Clustering

Master distance calculation methods including Minkowski distance, VDM for discrete attributes, and MinkovDM for mixed data types

Module 3 of 9

Intermediate Level

60-80 min

Why Distance Measures Matter

Clustering is fundamentally about "grouping by distance". The choice of distance metric directly determines which samples are considered similar and how clusters are formed. A poor distance metric will lead to poor clustering results, regardless of the algorithm used.

Core Principle

Samples with small distances are grouped into the same cluster, while samples with large distances are placed in different clusters.

Small dist(xᵢ, xⱼ) → Same cluster

Large dist(xᵢ, xⱼ) → Different clusters

Impact on Results

Different distance metrics can produce completely different clustering results on the same dataset

Data Type Matters

Continuous attributes, discrete attributes, and mixed data require different distance calculation methods

Distance Metric Properties

A valid distance metric must satisfy four fundamental properties. These properties ensure that distance behaves intuitively and mathematically consistently.

Non-Negativity

Mathematical Definition:

\text{dist}(x_i, x_j) \geq 0

Distance between any two samples is always non-negative (zero or positive)

Example: Distance from New York to Boston is always >= 0

Identity

Mathematical Definition:

\text{dist}(x_i, x_j) = 0 \iff x_i = x_j

Distance is zero if and only if the two samples are identical

Example: A sample's distance to itself is always 0

Symmetry

Mathematical Definition:

\text{dist}(x_i, x_j) = \text{dist}(x_j, x_i)

Distance from xᵢ to xⱼ equals distance from xⱼ to xᵢ

Example: Distance from A to B equals distance from B to A

Triangle Inequality

Mathematical Definition:

\text{dist}(x_i, x_j) \leq \text{dist}(x_i, x_k) + \text{dist}(x_k, x_j)

Direct distance is never longer than going through an intermediate point

Example: Direct route is shortest; detours always add distance

Minkowski Distance

The Minkowski distance is a generalized distance metric for continuous attributes. It's a parameterized family of distances that includes Euclidean and Manhattan distances as special cases.

General Formula

\text{dist}_{\text{mk}}(x_i, x_j) = \left(\sum_{u=1}^{n} |x_{iu} - x_{ju}|^p\right)^{1/p}

Where:

• n = number of attributes (dimensions)
• xᵢᵤ = value of attribute u for sample xᵢ
• p = distance parameter (p >= 1)

Euclidean Distance (p = 2)

Formula:

\text{dist}_{\text{eu}}(x_i, x_j) = \sqrt{\sum_{u=1}^{n} (x_{iu} - x_{ju})^2}

The most commonly used distance metric. Measures straight-line distance in n-dimensional space.

Best for:

• Spherical cluster distributions
• Continuous numerical data
• When all features are on similar scales

Manhattan Distance (p = 1)

Formula:

\text{dist}_{\text{ma}}(x_i, x_j) = \sum_{u=1}^{n} |x_{iu} - x_{ju}|

Also called L1 distance or taxicab distance. Sum of absolute differences along each dimension.

Best for:

• Irregular cluster shapes
• Data with outliers (more robust)
• High-dimensional sparse data

Example: Calculating Euclidean Distance

Consider two customers with features [Age, Income, Spending]:

Customer 1:

[28, 45000, 3200]

Customer 2:

[35, 65000, 4800]

Differences:

[7, 20000, 1600]

dist = √[(28-35)² + (45000-65000)² + (3200-4800)²]

dist = √[49 + 400,000,000 + 2,560,000]

dist = √402,560,049 ≈ 20,064

Note: In practice, features should be normalized to similar scales before calculating distance, otherwise large-scale features (like income) will dominate the distance calculation.

VDM Distance (Value Difference Metric)

The VDM (Value Difference Metric) is designed for unordered discrete attributes(categorical attributes without inherent ordering), such as color, category, or brand name.

Why VDM is Needed

For discrete attributes like "color" (Red, Blue, Green), we can't use simple subtraction. Instead, VDM measures distance based on how differently the attribute values are distributed across clusters.

Key Insight:

Two attribute values are similar if they appear in similar proportions across different clusters. If "Red" and "Blue" both appear 30% in Cluster 1 and 70% in Cluster 2, they're considered similar.

VDM Formula

\text{VDM}_p(a, b) = \sum_{i=1}^{k} \left|\frac{m_{u,a,i}}{m_{u,a}} - \frac{m_{u,b,i}}{m_{u,b}}\right|^p

Where:

• a, b = two values of attribute u
• m_u,a = total samples with attribute u = a
• m_u,a,i = samples in cluster i with attribute u = a
• k = number of clusters
• p = parameter (typically 1 or 2)

Example: VDM Calculation

Consider a "Color" attribute with values "Red" and "Blue" across 3 clusters:

Distribution:

• Red: 20 in Cluster 1, 10 in Cluster 2, 10 in Cluster 3 (total: 40)
• Blue: 5 in Cluster 1, 15 in Cluster 2, 20 in Cluster 3 (total: 40)

VDM_1(Red, Blue) = |20/40 - 5/40| + |10/40 - 15/40| + |10/40 - 20/40|

= |0.5 - 0.125| + |0.25 - 0.375| + |0.25 - 0.5|

= 0.375 + 0.125 + 0.25 = 0.75

Lower VDM values indicate more similar attribute values (similar cluster distributions).

MinkovDM Distance

The MinkovDM distance combines Minkowski distance for continuous attributes and VDM distance for discrete attributes, enabling distance calculation for mixed attribute data.

MinkovDM Formula

\text{MinkovDM}_p(x_i, x_j) = \left[\sum_{u=1}^{n_c} |x_{iu} - x_{ju}|^p + \sum_{u=n_c+1}^{n} \text{VDM}_p(x_{iu}, x_{ju})\right]^{1/p}

Where:

• n_c = number of continuous attributes
• First sum: Minkowski distance for continuous attributes (1 to n_c)
• Second sum: VDM distance for discrete attributes (n_c+1 to n)
• p = distance parameter (same as Minkowski)

Example: Mixed Attribute Data

Consider customer data with both continuous and discrete attributes:

ID	Age (cont.)	Income (cont.)	Color (disc.)	Spending (cont.)
1	28	$45,000	Blue	$3,200
2	35	$65,000	Red	$4,800
3	22	$28,000	Blue	$1,200

Distance Calculation:

For Customer 1 vs Customer 2 (p=2, Euclidean):

Continuous: √[(28-35)² + (45000-65000)² + (3200-4800)²]

Discrete: VDM_2(Blue, Red)

MinkovDM = [Continuous² + Discrete²]^(1/2)

Important Considerations

•Normalization: Continuous attributes should be normalized to similar scales before combining with discrete attributes
•Weighting: Consider weighting continuous vs discrete components if one type dominates
•VDM Calculation: VDM requires cluster information, so it's typically computed iteratively during clustering

Choosing the Right Distance Metric

Continuous Data

Use Minkowski distance:

• Euclidean (p=2): Spherical clusters
• Manhattan (p=1): Robust to outliers

Discrete Data

Use VDM distance:

• Unordered categories
• Cluster-based similarity

Mixed Data

Use MinkovDM:

• Combine both types
• Normalize first

Next Module