Metric Learning: Customizing a Precise Distance Ruler for Your Data

The Hook: When the Standard Ruler Gets It Wrong

Imagine you walk into a furniture store looking for a desk. The salesperson pulls out a "universal ruler" with only one scale: centimeters. It measures everything the same way. You need a desk that "fits a laptop," but this ruler only tells you "80cm long, 60cm wide" — it won't tell you that for your specific need, width matters way more than length (since laptops take up space mainly sideways).

The result? Following "maximum total area," the salesperson recommends a narrow desk that's "100cm long, 40cm wide" (area: 4000cm²). Meanwhile, the desk you actually need — "70cm long, 70cm wide" (area: 4900cm², but wide enough) — gets overlooked. Why? Because the "universal ruler" treats length and width as equally important, but for your task, they're not.

That's exactly the problem Metric Learning solves.

In machine learning, algorithms often rely on "calculating distance and finding neighbors" to classify samples (e.g., kNN). But off-the-shelf distance metrics (like Euclidean distance) are like that "universal ruler" — they treat all dimensions (features) as equally important, often getting the measurement wrong: samples that should be close end up far apart, and vice versa.

The core of metric learning: Customize a precise "distance ruler" for your specific task, giving important dimensions high weights (amplifying differences) and unimportant dimensions low weights (shrinking differences), so the calculated distance truly reflects sample similarity.

The Core Problem: Why Universal Distance Metrics "Miss the Mark"

Let's start with a concrete scenario: Determining if two fruits are the same type of apple.

You have these features:

Color: Red = 1, Green = 0
Size: Diameter (cm)
Weight: Grams

Now consider two apples:

Apple A: Red (1), 8cm diameter, 150g
Apple B: Green (0), 8cm diameter, 150g

Using Euclidean distance:

d = \sqrt{(1-0)^2 + (8-8)^2 + (150-150)^2} = 1

Distance is only 1 — they seem "close," like they're "the same type."

But the reality is: Red apples and green apples are different varieties! The color difference (red vs. green) matters far more than size or weight differences. They should be "far apart."

The Euclidean distance problem: It treats all dimensions equally — color differs by 1, size by 0, weight by 0. Total distance: just 1. But for the task "determining apple variety," color should be weighted 10× more than size or weight, so the true distance should be:

d_{\text{true}} = \sqrt{10 \times (1-0)^2 + 0.1 \times (8-8)^2 + 0.1 \times (150-150)^2} \approx 3.16

Metric learning's goal: Automatically learn these weights (color × 10, size × 0.1, weight × 0.1) so the calculated distance reflects the task's true requirements.

Metric Learning's Core Tool: Mahalanobis Distance + Metric Matrix M

Metric learning doesn't invent distance from scratch. It builds on Mahalanobis Distance, using a "metric matrix M" to weight different dimensions.

What is Mahalanobis Distance?

The Mahalanobis distance formula is:

d_{\text{Mah}}(x_i, x_j) = \sqrt{(x_i - x_j)^T M (x_i - x_j)}

$x_i, x_j$ : Two samples (e.g., feature vectors of two apples).
$(x_i - x_j)$ : Their difference across all dimensions.
$M$ : The metric matrix, our "weight adjuster."

Core Logic

$(x_i - x_j)^T M (x_i - x_j)$ effectively multiplies each dimension's difference by its corresponding weight in M.
Large values in M amplify differences in those dimensions (e.g., color).
Small values in M shrink differences in those dimensions (e.g., size, weight).

How Does M "Adjust Weights"?

Simple example: For 3 dimensions (color, size, weight), M might look like:

M = \begin{bmatrix} 10 & 0 & 0 \\ 0 & 0.1 & 0 \\ 0 & 0 & 0.1 \end{bmatrix}

This M means:

Color dimension: Weight = 10 (differences amplified 10×).
Size dimension: Weight = 0.1 (differences shrunk).
Weight dimension: Weight = 0.1 (differences shrunk).

The resulting distance "emphasizes color, de-emphasizes size and weight," matching the need for "determining apple variety."

How to Learn M? Making Distance "Serve the Task"

Metric learning's core question: How to automatically learn M to fit the current task's needs?

Two common approaches:

Method 1: Constraint-Based Learning — "Must-Link / Cannot-Link"

If you have prior knowledge, such as:

Must-Link constraints: Samples A and B are definitely the same class (e.g., two red apples).
Cannot-Link constraints: Samples C and D are definitely different classes (e.g., red apple and green apple).

Then M's learning objective is:

Minimize distance for must-link pairs: $d_{\text{Mah}}(A, B) \to 0$
Maximize distance for cannot-link pairs: $d_{\text{Mah}}(C, D) \to \infty$

The optimization objective might look like:

\min_M \sum_{\text{must-link}} d_{\text{Mah}}(x_i, x_j)^2 - \sum_{\text{cannot-link}} d_{\text{Mah}}(x_i, x_j)^2

Left term: Sum of squared distances for must-link pairs (smaller is better).
Right term: Sum of squared distances for cannot-link pairs (larger is better, hence the minus sign).

By optimizing this objective, M is learned automatically — for instance, if "color dimension distinguishes must-link from cannot-link," it gets high weight.

Method 2: Task-Based Learning — NCA (Neighborhood Component Analysis)

A more direct approach: Directly optimize for downstream task performance.

For example, NCA aims to learn M that maximizes kNN classification accuracy.

The logic:

Use the learned distance (Mahalanobis) for kNN.
Calculate a probability: Each sample $x_i$ is "voted for" by its neighbor $x_j$ with probability depending on their distance — closer = higher probability.
Optimization goal: Maximize the probability that each sample is voted for by "same-class neighbors" (so kNN voting is less likely to be wrong).

Mathematically, NCA maximizes:

\max_M \sum_i \sum_{j: y_j = y_i} p_{ij}

$p_{ij}$ : Probability that sample $i$ is "voted for" by sample $j$ , inversely related to their Mahalanobis distance.
$y_j = y_i$ : Only consider same-class samples.

Result: The learned M makes same-class samples close and different-class samples far, so kNN naturally achieves higher classification accuracy using this distance.

Formula Intuition: M's "Behavior"

Back to the Mahalanobis distance formula:

d_{\text{Mah}}(x_i, x_j) = \sqrt{(x_i - x_j)^T M (x_i - x_j)}

What is M actually doing?

Think of M as a "dimension magnifying glass":

If M is a diagonal matrix (only diagonal entries are non-zero), it assigns independent weights to each dimension — e.g., $M = \text{diag}([10, 0.1, 0.1])$ means "color × 10, size × 0.1, weight × 0.1."
If M is a general matrix (non-diagonal entries also present), it not only adjusts individual dimension weights but also captures "correlations between dimensions" — e.g., "combination of color and size" might be more important than color alone.

Special Cases

If $M = I$ (identity matrix), Mahalanobis distance degenerates to Euclidean distance — all dimensions equally weighted.
If M has low rank (rank(M) is small), it effectively performs "dimensionality reduction" — setting unimportant dimension weights to 0, keeping only important ones.

Metric Learning vs. Dimensionality Reduction: Different Optimizations

At this point, you might wonder: How does metric learning differ from PCA or manifold learning we learned earlier?

Core difference:

Dimensionality reduction (PCA, manifold learning): Goal is "reducing the number of dimensions," projecting high-dimensional data to low-dimensional space.
Metric learning: Goal is "optimizing distance calculation," potentially keeping the same number of dimensions but making the distance metric more accurate.

Example:

PCA: Reduce 100-dimensional data to 10 dimensions, then use Euclidean distance in 10D space.
Metric learning: Still use 100-dimensional data, but learn an M that makes the distance reflect task requirements (e.g., high weights for important dimensions).

They can be combined:

First use PCA for dimensionality reduction (reduce computation).
Then use metric learning in the low-dimensional space to learn a precise distance (improve downstream task performance).

Why Metric Learning Matters: Helping kNN Find the Right Neighbors

The ultimate goal of metric learning is to serve downstream algorithms that rely on distance (e.g., kNN, clustering).

The Problem:

kNN's core is "finding the k nearest neighbors," but using universal distance (Euclidean) often "finds the wrong neighbors" — e.g., treating different-class samples as neighbors (because they happen to be close on unimportant dimensions), leading to misclassification.

Metric Learning's Solution:

Learn a task-specific M that makes same-class samples close and different-class samples far.
When kNN searches for neighbors using this distance, it's less likely to be misled by "unimportant dimensions," naturally improving classification accuracy.

Classic Use Cases:

Face Recognition: The same face under different lighting or angles might have large Euclidean distance in pixel space, but metric learning can learn an M that makes "faces of the same person" close.
Recommendation Systems: Similarity between two users shouldn't just be "number of browsed items" (Euclidean), but "types of items browsed" (e.g., both like tech products). Metric learning can learn this weighting.

Key Takeaways

The core problem metric learning solves: Off-the-shelf distances (Euclidean) treat all dimensions equally, often "measuring incorrectly" the true similarity in data. Metric learning learns a metric matrix M to weight dimensions differently, creating a "task-specific distance ruler."
Core tool: Mahalanobis distance + metric matrix M: In the formula $d = \sqrt{(x_i - x_j)^T M (x_i - x_j)}$ , M acts like a "dimension weight adjuster" — large values amplify differences, small values shrink them.
Two approaches to learning M:
- Constraint-based: Given must-link/cannot-link constraints, minimize distance for must-link pairs, maximize for cannot-link pairs.
- Task-based (e.g., NCA): Directly optimize for kNN classification accuracy, learning M that makes same-class samples close and different-class samples far.
Formula behavior: M is a "dimension magnifying glass." If M is diagonal, it independently weights each dimension. If M is general, it also captures inter-dimension correlations. When M = identity, Mahalanobis degenerates to Euclidean.
Difference from dimensionality reduction: Dimensionality reduction "reduces dimension count," metric learning "optimizes distance calculation." They can be combined: first reduce dimensions, then use metric learning to improve distance precision.

One-Liner

Metric learning is like customizing a precise distance ruler for your data: not all dimensions matter equally (the "universal ruler" of Euclidean distance), but giving key dimensions high weights and redundant ones low weights (the "custom ruler" of Mahalanobis distance), so samples that should be close are close, and those that should be far are far — when the ruler's accurate, kNN won't "misjudge the neighbors."

Ready to master machine learning fundamentals?

Explore our comprehensive course on machine learning techniques, from distance metrics to advanced algorithms. Build a solid foundation in understanding how to measure similarity and improve model performance.