MathIsimple

Kernel PCA

Extend PCA to handle nonlinear data through the kernel trick. Learn how to perform PCA in high-dimensional feature spaces without explicit computation.

Module 5 of 9
Advanced Level
90-120 min

Why Kernel PCA?

Standard PCA is a linear dimensionality reduction method. When data lies on a nonlinear manifold (e.g., a curved surface in high-dimensional space), linear PCA cannot capture the structure. Kernel PCA addresses this by mapping data to a high-dimensional feature space where it becomes linear.

Core Idea

Map samples to a high-dimensional feature space via nonlinear mappingϕ(x)\phi(x), then perform PCA in that space:

xRdϕϕ(x)RdPCAzRdx \in \mathbb{R}^d \xrightarrow{\phi} \phi(x) \in \mathbb{R}^{d'} \xrightarrow{\text{PCA}} z \in \mathbb{R}^{d''}

The kernel trick allows us to compute PCA in feature space without explicitly computingϕ(x)\phi(x).

The Kernel Trick

Instead of explicitly computing ϕ(xi)\phi(x_i) (which may be infinite-dimensional), we use a kernel function that computes inner products in feature space:

Kernel Function

κ(xi,xj)=ϕ(xi)Tϕ(xj)\kappa(x_i, x_j) = \phi(x_i)^T \phi(x_j)

Common kernel functions:

Polynomial Kernel:

κ(xi,xj)=(xiTxj+1)d\kappa(x_i, x_j) = (x_i^T x_j + 1)^d

RBF (Gaussian) Kernel:

κ(xi,xj)=exp(xixj22σ2)\kappa(x_i, x_j) = \exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma^2}\right)

Key Derivation

In feature space, the covariance matrix is:

Σ=1mi=1mϕ(xi)ϕ(xi)T\Sigma' = \frac{1}{m}\sum_{i=1}^m \phi(x_i)\phi(x_i)^T

The projection matrix WW in feature space can be expressed as:

W=i=1mϕ(xi)αiW = \sum_{i=1}^m \phi(x_i)\alpha_i

Kernel Matrix

Define the kernel matrix KRm×mK \in \mathbb{R}^{m \times m} whereKij=κ(xi,xj)K_{ij} = \kappa(x_i, x_j). The eigenvalue problem becomes:

KA=mλAK\Alpha = m\lambda\Alpha

Where A=(α1,α2,,αm)T\Alpha = (\alpha_1, \alpha_2, \ldots, \alpha_m)^T contains the coefficients.

Projecting New Samples

For a new sample xx, its projection onto the j-th principal component is:

zj=wjTϕ(x)=i=1mαijκ(xi,x)z_j = w_j^T \phi(x) = \sum_{i=1}^m \alpha_i^j \kappa(x_i, x)

Where αij\alpha_i^j is the coefficient for the j-th principal component and sample ii. This allows projection without explicit feature mapping.

Advantages and Limitations

Advantages

  • • Handles nonlinear data structures
  • • No explicit high-dimensional computation
  • • Flexible kernel selection
  • • Preserves kernel-based similarity

Limitations

  • • Kernel selection is critical
  • • O(m²) memory for kernel matrix
  • • Slow for large datasets
  • • No explicit feature interpretation