MathIsimple – Simple, Friendly Math Tools & Learning

Learning Objectives

Understand the goals and geometric interpretation of PCA

Derive principal components from eigenvalue decomposition

Calculate and interpret variance explained

Apply selection criteria (scree plot, Kaiser)

Distinguish between covariance and correlation PCA

Interpret loadings and scores

What is PCA?

Goals of PCA

Principal Component Analysis (PCA) transforms correlated variables into a set of uncorrelated variables called principal components.

Dimensionality Reduction

Reduce p variables to k components (k < p) while retaining most variance

Data Visualization

Project high-dimensional data onto 2D or 3D for visualization

Mathematical Foundation

Eigenvalue Decomposition

\boldsymbol{\Sigma} = \mathbf{P}\boldsymbol{\Lambda}\mathbf{P}^T

$\mathbf{P}$ : Eigenvector Matrix

Columns are orthonormal eigenvectors $\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_p$

$\boldsymbol{\Lambda}$ : Eigenvalue Matrix

Diagonal with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0$

Principal Components

i-th Principal Component

Y_i = \mathbf{e}_i^T\mathbf{X} = e_{i1}X_1 + e_{i2}X_2 + \cdots + e_{ip}X_p

Variance of PC

\text{Var}(Y_i) = \lambda_i

Covariance between PCs

\text{Cov}(Y_i, Y_j) = 0 \text{ for } i \neq j

Variance Explained

Total Variance

\sum_{i=1}^p \lambda_i = \text{tr}(\boldsymbol{\Sigma})

Proportion by PC $_i$

\frac{\lambda_i}{\sum_{j=1}^p \lambda_j}

Cumulative Variance (first k components)

\frac{\sum_{i=1}^k \lambda_i}{\sum_{j=1}^p \lambda_j}

Selecting Components

Selection Criteria

Scree Plot

Plot eigenvalues vs component number. Look for the "elbow" where values level off.

Kaiser Criterion

Retain components with $\lambda > 1$ (for correlation matrix PCA)

Cumulative Variance

Retain enough components to explain 70-90% of total variance

Cross-Validation

Use prediction error to select optimal number of components

Covariance vs Correlation PCA

Covariance Matrix PCA

Use when variables are on same scale. Results depend on measurement units.

Correlation Matrix PCA

Use when variables have different scales. Standardizes all variables first.

Loadings and Scores

Principal Component Loadings

Loadings measure the relationship between original variables and principal components:

Loading of variable j on PC i

l_{ij} = \sqrt{\lambda_i} \cdot e_{ij} = \text{Cov}(X_j, Y_i)

For Correlation PCA

Loadings equal correlations: $l_{ij} = \text{Corr}(X_j, Y_i)$

Interpretation

High |loading| → variable contributes strongly to that PC

Communality

h_j^2 = \sum_{i=1}^k l_{ij}^2

Variance of variable j explained by the first k components

Principal Component Scores

Scores are the values of PCs for each observation:

Score of observation j on PC i

y_{ji} = \mathbf{e}_i^T(\mathbf{x}_j - \bar{\mathbf{x}})

Properties

• Mean of scores = 0
• Variance of scores = $\lambda_i$
• Scores are uncorrelated across PCs

Uses

• Visualization (plot PC1 vs PC2)
• Input for clustering/regression
• Outlier detection

Geometric Interpretation

Data Cloud and Principal Axes

PCA finds the principal axes of the data ellipsoid:

First PC Direction

The direction of maximum variance in the data cloud (longest axis of ellipsoid)

Subsequent PCs

Orthogonal to previous PCs, maximizing remaining variance

Rotation Interpretation

PCA rotates the coordinate system to align with the principal axes. The eigenvector matrix $\mathbf{P}$ is an orthogonal rotation matrix.

Optimal Projection

PCA provides the optimal low-rank approximation to the data:

Reconstruction from k components

\hat{\mathbf{X}} = \bar{\mathbf{x}} + \sum_{i=1}^k y_i \mathbf{e}_i

Eckart-Young Theorem

The first k PCs give the best rank-k approximation to the data in terms of minimizing squared reconstruction error.

Sample PCA

Estimation from Sample Data

In practice, we estimate PCA from sample covariance matrix $\mathbf{S}$ :

Sample Covariance

\mathbf{S} = \frac{1}{n-1}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T

Sample Correlation

\mathbf{R} = \mathbf{D}^{-1/2}\mathbf{S}\mathbf{D}^{-1/2}

where $\mathbf{D} = \text{diag}(s_{11}, \ldots, s_{pp})$

Sample Eigenvalues & Eigenvectors

Eigendecompose $\mathbf{S}$ (or $\mathbf{R}$ ) to get sample eigenvalues $\hat{\lambda}_i$ and eigenvectors $\hat{\mathbf{e}}_i$

Large Sample Properties

For large n, sample eigenvalues are consistent estimators:

Consistency

$\hat{\lambda}_i \xrightarrow{p} \lambda_i$ as $n \to \infty$

Asymptotic Variance

For normal data: $\text{Var}(\hat{\lambda}_i) \approx \frac{2\lambda_i^2}{n}$

Caution: Close Eigenvalues

When population eigenvalues are close, sample eigenvectors can be unstable. Consider bootstrapping for inference.

Applications

Common Applications

Data Visualization

Project high-dimensional data to 2D/3D for exploratory analysis

Noise Reduction

Reconstruct data using only top components to filter noise

Feature Extraction

Use PC scores as input features for machine learning

Multicollinearity

Address multicollinearity in regression using PC scores

Biplots and Visualization

Biplot Construction

A biplot displays both observations (scores) and variables (loadings) in the same plot:

Observation Points

Plot PC scores $(y_{i1}, y_{i2})$ for each observation

Variable Arrows

Draw loadings $(l_{j1}, l_{j2})$ as arrows from origin

Interpretation

• Arrow length ≈ variable's contribution to displayed PCs
• Arrow direction shows correlation with PCs
• Angle between arrows ≈ correlation between variables
• Project points onto arrows to estimate variable values

Scree Plot Details

The scree plot helps determine how many components to retain:

What to Plot

Eigenvalues (y-axis) vs component number (x-axis)

Elbow Rule

Retain components before the "elbow" where decline levels off

Extensions and Variations

Robust PCA

Standard PCA is sensitive to outliers. Robust alternatives include:

Robust Covariance

Use MCD (Minimum Covariance Determinant) instead of sample covariance

Projection Pursuit

Find directions maximizing robust scale measures

Kernel PCA

For nonlinear dimensionality reduction:

K_{ij} = \kappa(\mathbf{x}_i, \mathbf{x}_j)

Apply PCA to the kernel matrix instead of the covariance matrix. Common kernels: RBF (Gaussian), polynomial.

Sparse PCA

For interpretable loadings, add sparsity constraints:

\max_{\mathbf{a}} \mathbf{a}^T\boldsymbol{\Sigma}\mathbf{a} - \lambda\|\mathbf{a}\|_1 \quad \text{s.t.} \quad \|\mathbf{a}\|_2 = 1

L1 penalty encourages many loadings to be exactly zero, making components easier to interpret.

Worked Example

Example: 3-Variable PCA

Given correlation matrix:

\mathbf{R} = \begin{pmatrix} 1 & 0.8 & 0.2 \\ 0.8 & 1 & 0.3 \\ 0.2 & 0.3 & 1 \end{pmatrix}

Eigenvalues

$\lambda_1 = 1.95, \lambda_2 = 0.75, \lambda_3 = 0.30$

Variance Explained

PC1: 65%, PC2: 25%, PC3: 10%

Decision

Retain 2 components (90% cumulative variance). Kaiser criterion also suggests 2 (only $\lambda_1, \lambda_2 > 1$ for correlation PCA).

PCA Implementation Steps

Step-by-Step Algorithm

Center the data: Subtract mean from each variable $\tilde{\mathbf{x}}_i = \mathbf{x}_i - \bar{\mathbf{x}}$
Optionally standardize: Divide by standard deviation for correlation-based PCA
Compute covariance/correlation matrix: $\mathbf{S}$ or $\mathbf{R}$
Eigendecomposition: Find eigenvalues and eigenvectors
Sort: Order by decreasing eigenvalue
Select components: Choose k components based on criteria
Project data: Compute scores $\mathbf{Y} = \mathbf{X}\mathbf{P}_k$

Data Preprocessing Decisions

When to Standardize

• Variables on different scales
• Variance differences are artificial
• Want equal variable contribution

When to Use Raw Data

• Variables on same scale
• Variance differences are meaningful
• Want high-variance variables to dominate

Reconstruction and Approximation

Data Reconstruction

Reconstruct original data from k components:

\hat{\mathbf{X}} = \mathbf{Y}_k\mathbf{P}_k^T + \bar{\mathbf{X}}

Reconstruction Error

\|\mathbf{X} - \hat{\mathbf{X}}\|^2 = \sum_{j=k+1}^p \lambda_j

Best Approximation

PCA gives optimal rank-k approximation (Eckart-Young theorem)

SVD Connection

PCA is closely related to Singular Value Decomposition:

\tilde{\mathbf{X}} = \mathbf{U}\mathbf{D}\mathbf{V}^T

Relationship

PC loadings = $\mathbf{V}$ , singular values relate to eigenvalues

Advantage of SVD

Numerically stable, works for n < p

Interpretation Guidelines

Reading Loading Matrices

High Positive Loading

Variable increases as PC score increases

High Negative Loading

Variable decreases as PC score increases

Near-Zero Loading

Variable contributes little to that PC

Naming Components

Label based on high-loading variables' common theme

Rule of thumb: Focus on loadings with absolute value > 0.3 or 0.4 for interpretation.

Score Interpretation

High PC Score

Observation is above average on variables with positive loadings

Low PC Score

Observation is below average on variables with positive loadings

Outlier Detection

Extreme scores may indicate unusual observations. Check last PCs for orthogonal outliers.

Limitations and Alternatives

PCA Limitations

Linear Only

Cannot capture nonlinear relationships

Sensitive to Outliers

Outliers can distort principal components

Scale Dependent

Results change with variable scaling

Unsupervised

Doesn't consider response variable

When to Use Alternatives

Nonlinear Data → Kernel PCA

Maps to higher-dimensional space via kernel

Latent Factors → Factor Analysis

When you believe latent constructs cause correlations

Supervised → PLS

Partial Least Squares considers response

Non-negative Data → NMF

Non-negative Matrix Factorization for counts/images

PCA vs Other Methods

Method Comparison

PCA vs Factor Analysis

PCA is descriptive (maximize variance); FA is model-based (latent factors)

PCA vs ICA

ICA seeks statistically independent components, not just uncorrelated

PCA vs t-SNE

t-SNE preserves local structure for visualization; PCA preserves global variance

PCA vs MDS

MDS preserves distances; classical MDS with Euclidean distance equals PCA

When to Choose PCA

Use PCA When

• Linear relationships dominate
• Need interpretable components
• Want to reduce multicollinearity
• Preprocessing for other methods

Consider Alternatives When

• Data is manifold-structured (use UMAP)
• Need independent signals (use ICA)
• Seeking latent constructs (use FA)
• Non-negative data (use NMF)

Real-World Applications

Application Domains

Image Processing

Face recognition (Eigenfaces), image compression, noise reduction

Finance

Portfolio risk analysis, yield curve modeling, factor investing

Genomics

Population structure analysis, gene expression studies

Signal Processing

EEG/MEG analysis, speech recognition preprocessing

Case Study: Image Compression

Using PCA for image compression:

Treat each row (or patch) as observation vector
Compute PCA on image data matrix
Keep top k components to achieve target compression ratio
Reconstruct image: $\hat{\mathbf{X}} \approx \bar{\mathbf{X}} + \mathbf{Y}_k\mathbf{P}_k^T$

Result: Often 90%+ variance captured with 10-20% of components, enabling significant compression with minimal quality loss.

Software Implementation

Common Implementations

Python (scikit-learn)

sklearn.decomposition.PCA

R

prcomp(), princomp()

MATLAB

pca(), pcacov()

Julia

MultivariateStats.fit(PCA, ...)

Note: Most implementations use SVD internally for numerical stability. Check documentation for centering/scaling options.

Kernel PCA

Nonlinear Extension

Kernel PCA captures nonlinear patterns by implicitly mapping data to high-dimensional space:

\phi: \mathbb{R}^p \to \mathcal{H}, \quad k(\mathbf{x}, \mathbf{y}) = \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle

RBF Kernel

k(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{\|\mathbf{x}-\mathbf{y}\|^2}{2\sigma^2}\right)

Polynomial Kernel

k(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T\mathbf{y} + c)^d

Sparse PCA

Interpretable Components

Sparse PCA adds L1 penalty to produce components with many zero loadings:

\max_{\mathbf{w}} \mathbf{w}^T\mathbf{S}\mathbf{w} - \lambda\|\mathbf{w}\|_1 \quad \text{s.t.} \quad \|\mathbf{w}\|_2 = 1

Advantage

Easier interpretation; automatic variable selection

Trade-off

May explain slightly less variance than standard PCA

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

The first principal component is the linear combination that:

Not attempted

2

Principal components are derived from the eigenvalue decomposition of:

Not attempted

3

If

\lambda_1 = 5, \lambda_2 = 3, \lambda_3 = 2

, the proportion of variance explained by PC1 is:

Not attempted

4

The Kaiser criterion suggests retaining components with eigenvalues:

Not attempted

5

Principal components are always:

Not attempted

6

When should you use correlation matrix instead of covariance matrix for PCA?

Not attempted

7

The total variance in PCA equals:

Not attempted

8

A scree plot shows:

Not attempted

9

PCA loadings represent:

Not attempted

10

If we retain

k

components from

p

variables, the dimension reduction ratio is:

Not attempted

FAQ

What's the difference between PCA and Factor Analysis?

PCA creates components that are linear combinations of all variables to maximize variance. Factor Analysis posits latent factors that cause observed correlations. PCA is more descriptive; FA is model-based.

Does PCA require normality?

No, PCA doesn't require normality—it's based on variances and covariances. However, normality helps for inference and outlier detection.

How do I interpret loadings?

Loadings show the correlation between original variables and PCs. High absolute loadings indicate strong contribution to that component.

Can PCA handle missing data?

Standard PCA requires complete data. Options include: listwise deletion, imputation before PCA, or specialized methods like probabilistic PCA.