Master the most classic linear dimensionality reduction method. Learn PCA's dual criteria, optimization derivation, algorithm steps, and methods for selecting optimal dimensionality.
Principal Component Analysis (PCA) is the most widely used linear dimensionality reduction technique. It finds a linear projection that maximizes the variance of the projected data, or equivalently, minimizes the reconstruction error.
PCA finds the directions (principal components) in which the data varies the most. These directions capture the most information about the data structure.
PCA can be derived from two equivalent perspectives:
Find the low-dimensional hyperplane such that the sum of squared distances from sample points to the hyperplane is minimized.
Where is the reconstruction of from its projection.
Find the projection direction that maximizes the variance of projected samples, making them as spread out as possible.
Maximizing variance ensures maximum information retention in the projection.
These two criteria are mathematically equivalent. Minimizing reconstruction error is equivalent to maximizing variance. This dual interpretation provides both intuitive understanding and computational convenience.
Starting from the maximum separability perspective:
First, center all samples by subtracting the mean:
Centering ensures the mean is zero, simplifying subsequent calculations.
Compute the sample covariance matrix:
Where is the centered data matrix.
The covariance matrix captures how features vary together.
The variance of projected data is:
We maximize this subject to (orthonormal constraint):
Using Lagrange multipliers, the optimal consists of the eigenvectors of corresponding to the largest eigenvalues:
Select the top eigenvectors (principal components) ordered by decreasing eigenvalues.
The complete PCA algorithm:
Compute mean and center each sample: .
Calculate where is the centered data matrix.
Decompose and sort eigenvalues in descending order: .
Take the top eigenvectors to form projection matrix.
For a new sample , compute low-dimensional representation:.
Choosing the right number of dimensions is crucial. Here are three common methods:
Specify based on downstream task requirements (e.g., visualization needs 2D or 3D, or computational constraints limit dimensions).
Train a downstream model (e.g., kNN classifier) with different values and select the one with best performance.
This method optimizes for the specific learning task.
Select the smallest such that the cumulative variance ratio exceeds a threshold (typically 0.85 or 0.9):
This ensures we retain at least of the total variance in the data.