MathIsimple

Metric Learning

Elevate from dimensionality reduction to learning optimal distance metrics. Understand how to learn metrics that make similar samples closer and dissimilar samples farther apart.

Module 9 of 9
Advanced Level
100-120 min

What is Metric Learning?

Metric learning goes beyond dimensionality reduction by directly learning an optimal distance metric for a specific task. Instead of finding a low-dimensional space, metric learning learns a distance function that makes similar samples close and dissimilar samples far apart.

Key Insight

Every low-dimensional space corresponds to a distance metric. Metric learning skips the dimensionality reduction step and directly learns the optimal metric, which can incorporate supervised information and domain knowledge.

Evolution of Distance Metrics

Distance metrics have evolved from simple to sophisticated:

1. Euclidean Distance

distED2(xi,xj)=xixj22=u=1d(xiuxju)2dist_{ED}^2(x_i, x_j) = \|x_i - x_j\|_2^2 = \sum_{u=1}^d (x_{iu} - x_{ju})^2

All features weighted equally. Simple but ignores feature importance and correlations.

2. Weighted Euclidean Distance

distWED2(xi,xj)=u=1dwu(xiuxju)2=(xixj)TW(xixj)dist_{WED}^2(x_i, x_j) = \sum_{u=1}^d w_u (x_{iu} - x_{ju})^2 = (x_i - x_j)^T W (x_i - x_j)

Diagonal weight matrix WW allows different feature importance, but still ignores feature correlations.

3. Mahalanobis Distance

distMah2(xi,xj)=(xixj)TM(xixj)dist_{Mah}^2(x_i, x_j) = (x_i - x_j)^T M (x_i - x_j)

General symmetric positive semi-definite matrix MM (metric matrix) captures feature importance and correlations. This is what metric learning optimizes.

Constraint: M0M \succeq 0 (positive semi-definite) ensures distance properties (non-negativity, symmetry, triangle inequality).

Neighborhood Component Analysis (NCA)

NCA is a supervised metric learning method that optimizes the leave-one-out accuracy of kNN classification.

Objective Function

For a projection matrix PP where M=PPTM = P P^T, the probability that sample ii is correctly classified by its neighbors:

pi=jΩiexp(PTxiPTxj22)lexp(PTxiPTxl22)p_i = \sum_{j \in \Omega_i} \frac{\exp(-\|P^T x_i - P^T x_j\|_2^2)}{\sum_l \exp(-\|P^T x_i - P^T x_l\|_2^2)}

Where Ωi\Omega_i is the set of samples with the same class asxix_i.

NCA maximizes the sum of these probabilities:

maxPi=1mpi\max_P \sum_{i=1}^m p_i

Constrained Metric Learning

Incorporate domain knowledge through constraints on similar and dissimilar pairs:

Optimization Problem

Given sets of similar pairs M\mathcal{M} (must-link) and dissimilar pairs C\mathcal{C} (cannot-link):

minM(xi,xj)MxixjM2\min_M \sum_{(x_i, x_j) \in \mathcal{M}} \|x_i - x_j\|_M^2

Subject to:

(xi,xk)CxixkM21,M0\sum_{(x_i, x_k) \in \mathcal{C}} \|x_i - x_k\|_M^2 \geq 1, \quad M \succeq 0

This minimizes distances between similar pairs while ensuring dissimilar pairs are far apart, solved via convex optimization.

Advantages of Metric Learning

Task-Specific Optimization

Learns metrics optimized for specific learning tasks (e.g., kNN classification accuracy) rather than generic variance preservation.

Supervised Information

Can incorporate class labels, pairwise constraints, or domain knowledge to guide metric learning.

No Explicit Dimensionality Reduction

Works directly in original space, but if MM has rankdd', it implicitly performs dimensionality reduction.