Machine Learning/Learning Center/Dimensionality Reduction/Metric Learning

Metric Learning

Elevate from dimensionality reduction to learning optimal distance metrics. Understand how to learn metrics that make similar samples closer and dissimilar samples farther apart.

Module 9 of 9

Advanced Level

100-120 min

What is Metric Learning?

Metric learning goes beyond dimensionality reduction by directly learning an optimal distance metric for a specific task. Instead of finding a low-dimensional space, metric learning learns a distance function that makes similar samples close and dissimilar samples far apart.

Key Insight

Every low-dimensional space corresponds to a distance metric. Metric learning skips the dimensionality reduction step and directly learns the optimal metric, which can incorporate supervised information and domain knowledge.

Evolution of Distance Metrics

Distance metrics have evolved from simple to sophisticated:

1. Euclidean Distance

dist_{ED}^2(x_i, x_j) = \|x_i - x_j\|_2^2 = \sum_{u=1}^d (x_{iu} - x_{ju})^2

All features weighted equally. Simple but ignores feature importance and correlations.

2. Weighted Euclidean Distance

dist_{WED}^2(x_i, x_j) = \sum_{u=1}^d w_u (x_{iu} - x_{ju})^2 = (x_i - x_j)^T W (x_i - x_j)

Diagonal weight matrix $W$ allows different feature importance, but still ignores feature correlations.

3. Mahalanobis Distance

dist_{Mah}^2(x_i, x_j) = (x_i - x_j)^T M (x_i - x_j)

General symmetric positive semi-definite matrix $M$ (metric matrix) captures feature importance and correlations. This is what metric learning optimizes.

Constraint: $M \succeq 0$ (positive semi-definite) ensures distance properties (non-negativity, symmetry, triangle inequality).

Neighborhood Component Analysis (NCA)

NCA is a supervised metric learning method that optimizes the leave-one-out accuracy of kNN classification.

Objective Function

For a projection matrix $P$ where $M = P P^T$ , the probability that sample $i$ is correctly classified by its neighbors:

p_i = \sum_{j \in \Omega_i} \frac{\exp(-\|P^T x_i - P^T x_j\|_2^2)}{\sum_l \exp(-\|P^T x_i - P^T x_l\|_2^2)}

Where $\Omega_i$ is the set of samples with the same class as $x_i$ .

NCA maximizes the sum of these probabilities:

\max_P \sum_{i=1}^m p_i

Constrained Metric Learning

Incorporate domain knowledge through constraints on similar and dissimilar pairs:

Optimization Problem

Given sets of similar pairs $\mathcal{M}$ (must-link) and dissimilar pairs $\mathcal{C}$ (cannot-link):

\min_M \sum_{(x_i, x_j) \in \mathcal{M}} \|x_i - x_j\|_M^2

Subject to:

\sum_{(x_i, x_k) \in \mathcal{C}} \|x_i - x_k\|_M^2 \geq 1, \quad M \succeq 0

This minimizes distances between similar pairs while ensuring dissimilar pairs are far apart, solved via convex optimization.

Advantages of Metric Learning

Task-Specific Optimization

Learns metrics optimized for specific learning tasks (e.g., kNN classification accuracy) rather than generic variance preservation.

Supervised Information

Can incorporate class labels, pairwise constraints, or domain knowledge to guide metric learning.

No Explicit Dimensionality Reduction

Works directly in original space, but if $M$ has rank $d'$ , it implicitly performs dimensionality reduction.