Transform correlated variables into uncorrelated principal components for dimensionality reduction and data visualization
Principal Component Analysis (PCA) transforms correlated variables into a set of uncorrelated variables called principal components.
Dimensionality Reduction
Reduce p variables to k components (k < p) while retaining most variance
Data Visualization
Project high-dimensional data onto 2D or 3D for visualization
: Eigenvector Matrix
Columns are orthonormal eigenvectors
: Eigenvalue Matrix
Diagonal with
i-th Principal Component
Variance of PC
Covariance between PCs
Total Variance
Proportion by PC
Cumulative Variance (first k components)
Scree Plot
Plot eigenvalues vs component number. Look for the "elbow" where values level off.
Kaiser Criterion
Retain components with (for correlation matrix PCA)
Cumulative Variance
Retain enough components to explain 70-90% of total variance
Cross-Validation
Use prediction error to select optimal number of components
Covariance Matrix PCA
Use when variables are on same scale. Results depend on measurement units.
Correlation Matrix PCA
Use when variables have different scales. Standardizes all variables first.
Loadings measure the relationship between original variables and principal components:
Loading of variable j on PC i
For Correlation PCA
Loadings equal correlations:
Interpretation
High |loading| → variable contributes strongly to that PC
Communality
Variance of variable j explained by the first k components
Scores are the values of PCs for each observation:
Score of observation j on PC i
Properties
Uses
PCA finds the principal axes of the data ellipsoid:
First PC Direction
The direction of maximum variance in the data cloud (longest axis of ellipsoid)
Subsequent PCs
Orthogonal to previous PCs, maximizing remaining variance
Rotation Interpretation
PCA rotates the coordinate system to align with the principal axes. The eigenvector matrix is an orthogonal rotation matrix.
PCA provides the optimal low-rank approximation to the data:
Reconstruction from k components
Eckart-Young Theorem
The first k PCs give the best rank-k approximation to the data in terms of minimizing squared reconstruction error.
In practice, we estimate PCA from sample covariance matrix :
Sample Covariance
Sample Correlation
where
Sample Eigenvalues & Eigenvectors
Eigendecompose (or ) to get sample eigenvalues and eigenvectors
For large n, sample eigenvalues are consistent estimators:
Consistency
as
Asymptotic Variance
For normal data:
Caution: Close Eigenvalues
When population eigenvalues are close, sample eigenvectors can be unstable. Consider bootstrapping for inference.
Data Visualization
Project high-dimensional data to 2D/3D for exploratory analysis
Noise Reduction
Reconstruct data using only top components to filter noise
Feature Extraction
Use PC scores as input features for machine learning
Multicollinearity
Address multicollinearity in regression using PC scores
A biplot displays both observations (scores) and variables (loadings) in the same plot:
Observation Points
Plot PC scores for each observation
Variable Arrows
Draw loadings as arrows from origin
Interpretation
The scree plot helps determine how many components to retain:
What to Plot
Eigenvalues (y-axis) vs component number (x-axis)
Elbow Rule
Retain components before the "elbow" where decline levels off
Standard PCA is sensitive to outliers. Robust alternatives include:
Robust Covariance
Use MCD (Minimum Covariance Determinant) instead of sample covariance
Projection Pursuit
Find directions maximizing robust scale measures
For nonlinear dimensionality reduction:
Apply PCA to the kernel matrix instead of the covariance matrix. Common kernels: RBF (Gaussian), polynomial.
For interpretable loadings, add sparsity constraints:
L1 penalty encourages many loadings to be exactly zero, making components easier to interpret.
Given correlation matrix:
Eigenvalues
Variance Explained
PC1: 65%, PC2: 25%, PC3: 10%
Decision
Retain 2 components (90% cumulative variance). Kaiser criterion also suggests 2 (only for correlation PCA).
When to Standardize
When to Use Raw Data
Reconstruct original data from k components:
Reconstruction Error
Best Approximation
PCA gives optimal rank-k approximation (Eckart-Young theorem)
PCA is closely related to Singular Value Decomposition:
Relationship
PC loadings = , singular values relate to eigenvalues
Advantage of SVD
Numerically stable, works for n < p
High Positive Loading
Variable increases as PC score increases
High Negative Loading
Variable decreases as PC score increases
Near-Zero Loading
Variable contributes little to that PC
Naming Components
Label based on high-loading variables' common theme
Rule of thumb: Focus on loadings with absolute value > 0.3 or 0.4 for interpretation.
High PC Score
Observation is above average on variables with positive loadings
Low PC Score
Observation is below average on variables with positive loadings
Outlier Detection
Extreme scores may indicate unusual observations. Check last PCs for orthogonal outliers.
Linear Only
Cannot capture nonlinear relationships
Sensitive to Outliers
Outliers can distort principal components
Scale Dependent
Results change with variable scaling
Unsupervised
Doesn't consider response variable
Nonlinear Data → Kernel PCA
Maps to higher-dimensional space via kernel
Latent Factors → Factor Analysis
When you believe latent constructs cause correlations
Supervised → PLS
Partial Least Squares considers response
Non-negative Data → NMF
Non-negative Matrix Factorization for counts/images
PCA vs Factor Analysis
PCA is descriptive (maximize variance); FA is model-based (latent factors)
PCA vs ICA
ICA seeks statistically independent components, not just uncorrelated
PCA vs t-SNE
t-SNE preserves local structure for visualization; PCA preserves global variance
PCA vs MDS
MDS preserves distances; classical MDS with Euclidean distance equals PCA
Use PCA When
Consider Alternatives When
Image Processing
Face recognition (Eigenfaces), image compression, noise reduction
Finance
Portfolio risk analysis, yield curve modeling, factor investing
Genomics
Population structure analysis, gene expression studies
Signal Processing
EEG/MEG analysis, speech recognition preprocessing
Using PCA for image compression:
Result: Often 90%+ variance captured with 10-20% of components, enabling significant compression with minimal quality loss.
Python (scikit-learn)
sklearn.decomposition.PCA
R
prcomp(), princomp()
MATLAB
pca(), pcacov()
Julia
MultivariateStats.fit(PCA, ...)
Note: Most implementations use SVD internally for numerical stability. Check documentation for centering/scaling options.
Kernel PCA captures nonlinear patterns by implicitly mapping data to high-dimensional space:
RBF Kernel
Polynomial Kernel
Sparse PCA adds L1 penalty to produce components with many zero loadings:
Advantage
Easier interpretation; automatic variable selection
Trade-off
May explain slightly less variance than standard PCA
PCA creates components that are linear combinations of all variables to maximize variance. Factor Analysis posits latent factors that cause observed correlations. PCA is more descriptive; FA is model-based.
No, PCA doesn't require normality—it's based on variances and covariances. However, normality helps for inference and outlier detection.
Loadings show the correlation between original variables and PCs. High absolute loadings indicate strong contribution to that component.
Standard PCA requires complete data. Options include: listwise deletion, imputation before PCA, or specialized methods like probabilistic PCA.