Machine Learning/Learning Center/Dimensionality Reduction/Kernel PCA

Kernel PCA

Extend PCA to handle nonlinear data through the kernel trick. Learn how to perform PCA in high-dimensional feature spaces without explicit computation.

Module 5 of 9

Advanced Level

90-120 min

Why Kernel PCA?

Standard PCA is a linear dimensionality reduction method. When data lies on a nonlinear manifold (e.g., a curved surface in high-dimensional space), linear PCA cannot capture the structure. Kernel PCA addresses this by mapping data to a high-dimensional feature space where it becomes linear.

Core Idea

Map samples to a high-dimensional feature space via nonlinear mapping $\phi(x)$ , then perform PCA in that space:

x \in \mathbb{R}^d \xrightarrow{\phi} \phi(x) \in \mathbb{R}^{d'} \xrightarrow{\text{PCA}} z \in \mathbb{R}^{d''}

The kernel trick allows us to compute PCA in feature space without explicitly computing $\phi(x)$ .

The Kernel Trick

Instead of explicitly computing $\phi(x_i)$ (which may be infinite-dimensional), we use a kernel function that computes inner products in feature space:

Kernel Function

\kappa(x_i, x_j) = \phi(x_i)^T \phi(x_j)

Common kernel functions:

Polynomial Kernel:

\kappa(x_i, x_j) = (x_i^T x_j + 1)^d

RBF (Gaussian) Kernel:

\kappa(x_i, x_j) = \exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma^2}\right)

Key Derivation

In feature space, the covariance matrix is:

\Sigma' = \frac{1}{m}\sum_{i=1}^m \phi(x_i)\phi(x_i)^T

The projection matrix $W$ in feature space can be expressed as:

W = \sum_{i=1}^m \phi(x_i)\alpha_i

Kernel Matrix

Define the kernel matrix $K \in \mathbb{R}^{m \times m}$ where $K_{ij} = \kappa(x_i, x_j)$ . The eigenvalue problem becomes:

K\Alpha = m\lambda\Alpha

Where $\Alpha = (\alpha_1, \alpha_2, \ldots, \alpha_m)^T$ contains the coefficients.

Projecting New Samples

For a new sample $x$ , its projection onto the j-th principal component is:

z_j = w_j^T \phi(x) = \sum_{i=1}^m \alpha_i^j \kappa(x_i, x)

Where $\alpha_i^j$ is the coefficient for the j-th principal component and sample $i$ . This allows projection without explicit feature mapping.

Advantages and Limitations

Advantages

• Handles nonlinear data structures
• No explicit high-dimensional computation
• Flexible kernel selection
• Preserves kernel-based similarity

Limitations

• Kernel selection is critical
• O(m²) memory for kernel matrix
• Slow for large datasets
• No explicit feature interpretation

Next Module