MathIsimple – Simple, Friendly Math Tools & Learning

Main Dimensionality Reduction Methods

PCA (Principal Component Analysis)

Linear, unsupervised. Finds directions of maximum variance.

\mathbf{z} = \mathbf{W}^T(\mathbf{x} - \bar{\mathbf{x}})

✓ Fast, generalizes to new data | ✗ Linear only

t-SNE

Non-linear, for visualization. Preserves local structure.

✓ Beautiful visualizations, reveals clusters | ✗ Slow, doesn't generalize

LDA (Linear Discriminant Analysis)

Supervised. Maximizes class separation.

✓ Uses labels, good for classification | ✗ Max k=C-1 components

Autoencoders

Deep learning. Encoder-decoder with bottleneck.

✓ Non-linear, flexible | ✗ Requires training, complexity

Curse of Dimensionality

Why High Dimensions Are Problematic

Volume of Unit Hypersphere

V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2+1)}r^d

As d → ∞, nearly all volume concentrates near the surface! Center becomes "empty".

Distance Concentration

For random points in d-dimensional unit cube:

\mathbb{E}[\|\mathbf{x} - \mathbf{y}\|^2] = \frac{d}{6}

\text{Var}(\|\mathbf{x} - \mathbf{y}\|^2) = O(d)

Relative variance: Var/E² = O(1/d) → 0. All distances become nearly equal!

Sample Complexity

To maintain same sample density:

n \propto \rho^d

For ρ=10 samples per dimension: d=2 needs 100 samples, d=10 needs 10 billion!

PCA: Complete Mathematical Derivation

Variance Maximization via Eigendecomposition

Problem Setup

Given: Data matrix X ∈ R^(n×d), find linear projection w ∈ R^d (unit vector) that maximizes variance of projected data

\max_{\mathbf{w}} \text{Var}(\mathbf{Xw}) \quad \text{subject to } \|\mathbf{w}\| = 1

Derivation Step-by-Step

Step 1: Center data

X_centered = X - mean(X)

Step 2: Express variance

\text{Var}(\mathbf{Xw}) = \frac{1}{n}\mathbf{w}^T\mathbf{X}^T\mathbf{Xw} = \mathbf{w}^T\mathbf{C}\mathbf{w}

where C = (1/n)X^T X is covariance matrix

Step 3: Lagrangian with constraint ||w||=1

\mathcal{L} = \mathbf{w}^T\mathbf{C}\mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)

Step 4: Take derivative and set to zero

\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{C}\mathbf{w} - 2\lambda\mathbf{w} = 0

\mathbf{C}\mathbf{w} = \lambda\mathbf{w}

w is eigenvector of C, λ is eigenvalue!

Step 5: Variance equals eigenvalue

\text{Var}(\mathbf{Xw}) = \mathbf{w}^T\mathbf{C}\mathbf{w} = \mathbf{w}^T(\lambda\mathbf{w}) = \lambda

Conclusion: Choose eigenvectors with largest eigenvalues to maximize variance!

Reconstruction Error Equivalence

Theorem: Maximizing variance ⇔ Minimizing reconstruction error

\min_{\mathbf{W}} \sum_i \|\mathbf{x}_i - \mathbf{W}\mathbf{W}^T\mathbf{x}_i\|^2

PCA finds best k-dimensional linear subspace in least-squares sense.

SVD Connection

Singular Value Decomposition: X = UΣV^T

• Right singular vectors V are principal components
• Singular values σ_i: eigenvalues = σ_i²/n
• No need to compute C explicitly! More numerically stable.

In practice, use SVD for PCA computation

t-SNE: Cost Function & Gradient

Probabilistic Neighborhood Preservation

High-Dimensional Similarities

Gaussian similarity (conditional probability):

p_{j|i} = \frac{\exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|\mathbf{x}_i - \mathbf{x}_k\|^2 / 2\sigma_i^2)}

Symmetric version: p_ij = (p_j|i + p_i|j) / 2n

Low-Dimensional Similarities (Student-t)

q_{ij} = \frac{(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2)^{-1}}{\sum_{k \neq l}(1 + \|\mathbf{y}_k - \mathbf{y}_l\|^2)^{-1}}

Why Student-t (heavy-tailed)? Allows moderate distances in high-D to map to larger distances in low-D, alleviating crowding problem.

Cost Function (KL Divergence)

C = \sum_i \text{KL}(P_i \| Q_i) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}

Minimize difference between high-D and low-D probability distributions

Gradient for Optimization

\frac{\partial C}{\partial \mathbf{y}_i} = 4\sum_j (p_{ij} - q_{ij})(\mathbf{y}_i - \mathbf{y}_j)(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2)^{-1}

Optimized using gradient descent with momentum. Complexity: O(n²) per iteration.

Kernel PCA: Non-linear Extension

PCA in Feature Space via Kernel Trick

Idea

Apply PCA in high-dimensional feature space φ(x) without explicit mapping:

\mathbf{x} \xrightarrow{\phi} \phi(\mathbf{x}) \xrightarrow{\text{PCA}} \text{low-dim}

Kernel Matrix Formulation

In feature space, covariance: C = (1/n)Φ^T Φ where Φ = [φ(x_1)...φ(x_n)]

Principal components: Cv = λv becomes:

\frac{1}{n}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\mathbf{v} = \lambda\mathbf{v}

Multiply by Φ: (1/n)ΦΦ^T(Φv) = λ(Φv)

Kernel Matrix:

\mathbf{K} = \boldsymbol{\Phi}\boldsymbol{\Phi}^T, \quad K_{ij} = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j) = k(\mathbf{x}_i, \mathbf{x}_j)

Solve eigenvalue problem: Kα = nλα, then v = Φ^T α

Projection of New Point

z = \mathbf{v}^T\phi(\mathbf{x}_{\text{new}}) = \boldsymbol{\alpha}^T\boldsymbol{\Phi}\phi(\mathbf{x}_{\text{new}}) = \sum_i \alpha_i k(\mathbf{x}_i, \mathbf{x}_{\text{new}})

Only kernel evaluations needed! Never compute φ explicitly.

LDA: Fisher's Linear Discriminant

Maximizing Class Separability

Scatter Matrices

Within-class scatter:

S_W = \sum_{c=1}^{C} \sum_{\mathbf{x} \in c} (\mathbf{x} - \boldsymbol{\mu}_c)(\mathbf{x} - \boldsymbol{\mu}_c)^T

Measures spread within each class

Between-class scatter:

S_B = \sum_{c=1}^{C} n_c(\boldsymbol{\mu}_c - \boldsymbol{\mu})(\boldsymbol{\mu}_c - \boldsymbol{\mu})^T

Measures separation between class means

Fisher's Criterion

Find projection w that maximizes:

J(\mathbf{w}) = \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}}

Ratio of between-class to within-class variance

Solution via Generalized Eigenvalue Problem

Taking derivative and setting to zero:

S_B \mathbf{w} = \lambda S_W \mathbf{w}

If S_W invertible:

S_W^{-1} S_B \mathbf{w} = \lambda \mathbf{w}

Solution: Eigenvectors of S_W^(-1)S_B with largest eigenvalues

Dimensionality Limit

Theorem: LDA can extract at most C-1 features (C = number of classes)

Proof: S_B is sum of C rank-1 matrices centered at overall mean, so rank(S_B) ≤ C-1

Autoencoders for Non-linear Dimensionality Reduction

Neural Network-Based Dimensionality Reduction

Architecture

Encoder: x → h = f_encoder(x; θ_enc)
Decoder: h → x̂ = f_decoder(h; θ_dec)

\min_{\theta_{\text{enc}}, \theta_{\text{dec}}} \sum_i \|\mathbf{x}_i - f_{\text{dec}}(f_{\text{enc}}(\mathbf{x}_i))\|^2

Relation to PCA

Theorem: Linear autoencoder (no activation functions) with MSE loss learns same subspace as PCA

• Encoder W_enc, decoder W_dec
• At optimum: span(W_enc) = span(first k eigenvectors of C)
• Non-linear activations → can learn non-linear manifolds

Variational Autoencoder (VAE)

Probabilistic version: encode to distribution, not point

\mathcal{L} = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})] - D_{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))

Learns smooth latent space, enables generation of new samples!

Manifold Learning Theory

Non-linear Methods: ISOMAP, LLE, UMAP

Manifold Hypothesis

Assumption: High-dimensional data lies on or near a low-dimensional manifold embedded in high-D space

Example: Images of faces vary smoothly with pose, lighting → lie on low-D manifold in pixel space

ISOMAP: Geodesic Distances

Build k-NN graph
Compute shortest path distances (geodesic) using Dijkstra/Floyd-Warshall
Apply classical MDS on geodesic distance matrix

Preserves intrinsic geometry of manifold, not Euclidean distance

LLE: Local Linear Embedding

Step 1: Find weights W_ij that reconstruct x_i from neighbors:

\min_W \sum_i \left\|\mathbf{x}_i - \sum_{j \in N(i)} W_{ij}\mathbf{x}_j\right\|^2

Step 2: Find low-D embedding y preserving same weights:

\min_Y \sum_i \left\|\mathbf{y}_i - \sum_{j \in N(i)} W_{ij}\mathbf{y}_j\right\|^2

Preserves local linear structure!

Example: MNIST Digit Visualization

Task: Visualize 70,000 handwritten digits (28×28 pixels = 784 dimensions) in 2D

PCA

Fast (seconds), shows global structure, some digit separation

t-SNE

Slower (minutes), beautiful clusters, clear digit separation

Autoencoder

Training required, smooth manifold, can generate new digits

View More Examples & Exercises

UMAP: Uniform Manifold Approximation

Modern Alternative to t-SNE

Mathematical Foundation

Based on Riemannian geometry and algebraic topology

Key insight: Real data uniformly distributed on locally connected manifold

High-Dimensional Graph Construction

Fuzzy simplicial set membership:

w(x_i, x_j) = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)

• ρ_i = distance to nearest neighbor
• σ_i chosen so local neighborhood has fixed perplexity
• Symmetrize: v(x_i, x_j) = w(x_i, x_j) + w(x_j, x_i) - w(x_i, x_j)w(x_j, x_i)

Low-Dimensional Optimization

Cross-entropy loss:

\sum_{ij} v_{ij}\log\frac{v_{ij}}{w'_{ij}} + (1-v_{ij})\log\frac{1-v_{ij}}{1-w'_{ij}}

where w'_ij = (1 + ||y_i - y_j||²)^(-1) in low-D space

UMAP vs t-SNE

t-SNE

• Focuses on local structure
• Slower (O(n²) or O(n log n))
• Non-deterministic
• No transform for new data

UMAP

• Preserves both local & global
• Faster (O(n))
• More reproducible
• Can transform new data

Key Parameters

• n_neighbors: Size of local neighborhood (default: 15). Larger → more global structure.
• min_dist: Minimum distance between points in low-D (default: 0.1). Smaller → tighter clusters.
• metric: Distance function (Euclidean, cosine, etc.)

Choosing the Right Method

Decision Guide for Dimensionality Reduction

Use PCA when:

• Need fast, scalable solution
• Linear relationships expected
• Want to transform new data
• Need interpretable components
• Example: Preprocessing for ML pipeline

Use t-SNE/UMAP when:

• Visualization is primary goal (2D/3D)
• Non-linear structure expected
• Want to reveal clusters
• Computational cost acceptable
• Example: Exploratory data analysis, publication figures

Use LDA when:

• Have labeled data
• Want to maximize class separation
• Preprocessing for classification
• Example: Face recognition, speech recognition

Use Autoencoder when:

• Have large dataset
• Complex non-linear manifold
• Need generative capability
• GPU resources available
• Example: Image compression, anomaly detection

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

What is the main goal of dimensionality reduction?

Not attempted

2

PCA finds:

Not attempted

3

Principal components are:

Not attempted

4

If top 3 PCs explain 95% of variance, this means:

Not attempted

5

t-SNE (t-Distributed Stochastic Neighbor Embedding) is primarily used for:

Not attempted

6

Compared to PCA, t-SNE:

Not attempted

7

LDA (Linear Discriminant Analysis) for dimensionality reduction:

Not attempted

8

Autoencoders for dimensionality reduction:

Not attempted

9

The curse of dimensionality refers to:

Not attempted

10

Which method is NOT suitable for reducing 1000 dimensions to 2 for visualization?

Not attempted