MathIsimple

Principal Component Analysis (PCA)

Master the most classic linear dimensionality reduction method. Learn PCA's dual criteria, optimization derivation, algorithm steps, and methods for selecting optimal dimensionality.

Module 4 of 9
Intermediate to Advanced
100-120 min

What is Principal Component Analysis?

Principal Component Analysis (PCA) is the most widely used linear dimensionality reduction technique. It finds a linear projection that maximizes the variance of the projected data, or equivalently, minimizes the reconstruction error.

Key Insight

PCA finds the directions (principal components) in which the data varies the most. These directions capture the most information about the data structure.

Advantages

  • • Simple and interpretable
  • • No hyperparameters
  • • Fast computation
  • • Removes correlation
  • • Preserves maximum variance

Limitations

  • • Linear only
  • • Assumes Gaussian data
  • • Sensitive to scaling
  • • Loses interpretability
  • • May miss nonlinear structure

PCA's Dual Criteria

PCA can be derived from two equivalent perspectives:

1. Minimum Reconstruction Error

Find the low-dimensional hyperplane such that the sum of squared distances from sample points to the hyperplane is minimized.

minWi=1mxix~i2\min_W \sum_{i=1}^m \|x_i - \tilde{x}_i\|^2

Where x~i\tilde{x}_i is the reconstruction ofxix_i from its projection.

2. Maximum Separability

Find the projection direction that maximizes the variance of projected samples, making them as spread out as possible.

maxWVar(WTX)\max_W \text{Var}(W^T X)

Maximizing variance ensures maximum information retention in the projection.

Equivalence Proof

These two criteria are mathematically equivalent. Minimizing reconstruction error is equivalent to maximizing variance. This dual interpretation provides both intuitive understanding and computational convenience.

Optimization Derivation

Starting from the maximum separability perspective:

Step 1: Center the Data

First, center all samples by subtracting the mean:

xˉ=1mi=1mxi,xixixˉ\bar{x} = \frac{1}{m}\sum_{i=1}^m x_i, \quad x_i \leftarrow x_i - \bar{x}

Centering ensures the mean is zero, simplifying subsequent calculations.

Step 2: Covariance Matrix

Compute the sample covariance matrix:

Σ=1mXXT\Sigma = \frac{1}{m}XX^T

Where XRd×mX \in \mathbb{R}^{d \times m} is the centered data matrix.

The covariance matrix captures how features vary together.

Step 3: Optimization Objective

The variance of projected data is:

Var(WTX)=1mi=1m(WTxi)2=WTΣW\text{Var}(W^T X) = \frac{1}{m}\sum_{i=1}^m (W^T x_i)^2 = W^T \Sigma W

We maximize this subject to WTW=IW^T W = I (orthonormal constraint):

maxWtr(WTΣW)s.t. WTW=I\max_W \text{tr}(W^T \Sigma W) \quad \text{s.t. } W^T W = I

Step 4: Solution via Eigenvalue Decomposition

Using Lagrange multipliers, the optimal WW consists of the eigenvectors of Σ\Sigma corresponding to the largest eigenvalues:

ΣW=λW\Sigma W = \lambda W

Select the top dd' eigenvectors (principal components) ordered by decreasing eigenvalues.

PCA Algorithm Steps

The complete PCA algorithm:

Step 1

Center the Data

Compute mean xˉ=1mi=1mxi\bar{x} = \frac{1}{m}\sum_{i=1}^m x_i and center each sample: xixixˉx_i \leftarrow x_i - \bar{x}.

Step 2

Compute Covariance Matrix

Calculate Σ=1mXXT\Sigma = \frac{1}{m}XX^T whereXX is the centered data matrix.

Step 3

Eigenvalue Decomposition

Decompose Σ=VΛVT\Sigma = V\Lambda V^T and sort eigenvalues in descending order: λ1λ2λd\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d.

Step 4

Select Principal Components

Take the top dd' eigenvectorsw1,w2,,wdw_1, w_2, \ldots, w_{d'} to form projection matrixW=(w1,w2,,wd)W = (w_1, w_2, \ldots, w_{d'}).

Step 5

Project to Low Dimensions

For a new sample xx, compute low-dimensional representation:z=WT(xxˉ)z = W^T(x - \bar{x}).

Selecting the Optimal Dimensionality d'

Choosing the right number of dimensions is crucial. Here are three common methods:

Method 1: User Specification

Specify dd' based on downstream task requirements (e.g., visualization needs 2D or 3D, or computational constraints limit dimensions).

Method 2: Cross-Validation

Train a downstream model (e.g., kNN classifier) with different dd'values and select the one with best performance.

This method optimizes for the specific learning task.

Method 3: Variance Threshold (Scree Plot)

Select the smallest dd' such that the cumulative variance ratio exceeds a threshold tt (typically 0.85 or 0.9):

i=1dλii=1dλit\frac{\sum_{i=1}^{d'} \lambda_i}{\sum_{i=1}^d \lambda_i} \geq t

This ensures we retain at least t×100%t \times 100\% of the total variance in the data.