Machine Learning/Learning Center/Dimensionality Reduction/Principal Component Analysis

Principal Component Analysis (PCA)

Master the most classic linear dimensionality reduction method. Learn PCA's dual criteria, optimization derivation, algorithm steps, and methods for selecting optimal dimensionality.

Module 4 of 9

Intermediate to Advanced

100-120 min

What is Principal Component Analysis?

Principal Component Analysis (PCA) is the most widely used linear dimensionality reduction technique. It finds a linear projection that maximizes the variance of the projected data, or equivalently, minimizes the reconstruction error.

Key Insight

PCA finds the directions (principal components) in which the data varies the most. These directions capture the most information about the data structure.

Advantages

• Simple and interpretable
• No hyperparameters
• Fast computation
• Removes correlation
• Preserves maximum variance

Limitations

• Linear only
• Assumes Gaussian data
• Sensitive to scaling
• Loses interpretability
• May miss nonlinear structure

PCA's Dual Criteria

PCA can be derived from two equivalent perspectives:

1. Minimum Reconstruction Error

Find the low-dimensional hyperplane such that the sum of squared distances from sample points to the hyperplane is minimized.

\min_W \sum_{i=1}^m \|x_i - \tilde{x}_i\|^2

Where $\tilde{x}_i$ is the reconstruction of $x_i$ from its projection.

2. Maximum Separability

Find the projection direction that maximizes the variance of projected samples, making them as spread out as possible.

\max_W \text{Var}(W^T X)

Maximizing variance ensures maximum information retention in the projection.

Equivalence Proof

These two criteria are mathematically equivalent. Minimizing reconstruction error is equivalent to maximizing variance. This dual interpretation provides both intuitive understanding and computational convenience.

Optimization Derivation

Starting from the maximum separability perspective:

Step 1: Center the Data

First, center all samples by subtracting the mean:

\bar{x} = \frac{1}{m}\sum_{i=1}^m x_i, \quad x_i \leftarrow x_i - \bar{x}

Centering ensures the mean is zero, simplifying subsequent calculations.

Step 2: Covariance Matrix

Compute the sample covariance matrix:

\Sigma = \frac{1}{m}XX^T

Where $X \in \mathbb{R}^{d \times m}$ is the centered data matrix.

The covariance matrix captures how features vary together.

Step 3: Optimization Objective

The variance of projected data is:

\text{Var}(W^T X) = \frac{1}{m}\sum_{i=1}^m (W^T x_i)^2 = W^T \Sigma W

We maximize this subject to $W^T W = I$ (orthonormal constraint):

\max_W \text{tr}(W^T \Sigma W) \quad \text{s.t. } W^T W = I

Step 4: Solution via Eigenvalue Decomposition

Using Lagrange multipliers, the optimal $W$ consists of the eigenvectors of $\Sigma$ corresponding to the largest eigenvalues:

\Sigma W = \lambda W

Select the top $d'$ eigenvectors (principal components) ordered by decreasing eigenvalues.

PCA Algorithm Steps

The complete PCA algorithm:

Step 1

Center the Data

Compute mean $\bar{x} = \frac{1}{m}\sum_{i=1}^m x_i$ and center each sample: $x_i \leftarrow x_i - \bar{x}$ .

Step 2

Compute Covariance Matrix

Calculate $\Sigma = \frac{1}{m}XX^T$ where $X$ is the centered data matrix.

Step 3

Eigenvalue Decomposition

Decompose $\Sigma = V\Lambda V^T$ and sort eigenvalues in descending order: $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d$ .

Step 4

Select Principal Components

Take the top $d'$ eigenvectors $w_1, w_2, \ldots, w_{d'}$ to form projection matrix $W = (w_1, w_2, \ldots, w_{d'})$ .

Step 5

Project to Low Dimensions

For a new sample $x$ , compute low-dimensional representation: $z = W^T(x - \bar{x})$ .

Selecting the Optimal Dimensionality d'

Choosing the right number of dimensions is crucial. Here are three common methods:

Method 1: User Specification

Specify $d'$ based on downstream task requirements (e.g., visualization needs 2D or 3D, or computational constraints limit dimensions).

Method 2: Cross-Validation

Train a downstream model (e.g., kNN classifier) with different $d'$ values and select the one with best performance.

This method optimizes for the specific learning task.

Method 3: Variance Threshold (Scree Plot)

Select the smallest $d'$ such that the cumulative variance ratio exceeds a threshold $t$ (typically 0.85 or 0.9):

\frac{\sum_{i=1}^{d'} \lambda_i}{\sum_{i=1}^d \lambda_i} \geq t

This ensures we retain at least $t \times 100\%$ of the total variance in the data.

Next Module