MathIsimple
Back to Machine Learning
Feature Engineering
Intermediate to Advanced

Dimensionality Reduction

Master PCA, t-SNE, LDA, and autoencoders for feature extraction and visualization

Main Dimensionality Reduction Methods

PCA (Principal Component Analysis)

Linear, unsupervised. Finds directions of maximum variance.

z=WT(xxˉ)\mathbf{z} = \mathbf{W}^T(\mathbf{x} - \bar{\mathbf{x}})

✓ Fast, generalizes to new data | ✗ Linear only

t-SNE

Non-linear, for visualization. Preserves local structure.

✓ Beautiful visualizations, reveals clusters | ✗ Slow, doesn't generalize

LDA (Linear Discriminant Analysis)

Supervised. Maximizes class separation.

✓ Uses labels, good for classification | ✗ Max k=C-1 components

Autoencoders

Deep learning. Encoder-decoder with bottleneck.

✓ Non-linear, flexible | ✗ Requires training, complexity

Curse of Dimensionality

Why High Dimensions Are Problematic

Volume of Unit Hypersphere

Vd(r)=πd/2Γ(d/2+1)rdV_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2+1)}r^d

As d → ∞, nearly all volume concentrates near the surface! Center becomes "empty".

Distance Concentration

For random points in d-dimensional unit cube:

E[xy2]=d6\mathbb{E}[\|\mathbf{x} - \mathbf{y}\|^2] = \frac{d}{6}Var(xy2)=O(d)\text{Var}(\|\mathbf{x} - \mathbf{y}\|^2) = O(d)

Relative variance: Var/E² = O(1/d) → 0. All distances become nearly equal!

Sample Complexity

To maintain same sample density:

nρdn \propto \rho^d

For ρ=10 samples per dimension: d=2 needs 100 samples, d=10 needs 10 billion!

PCA: Complete Mathematical Derivation

Variance Maximization via Eigendecomposition

Problem Setup

Given: Data matrix X ∈ R^(n×d), find linear projection w ∈ R^d (unit vector) that maximizes variance of projected data

maxwVar(Xw)subject to w=1\max_{\mathbf{w}} \text{Var}(\mathbf{Xw}) \quad \text{subject to } \|\mathbf{w}\| = 1

Derivation Step-by-Step

Step 1: Center data

X_centered = X - mean(X)

Step 2: Express variance

Var(Xw)=1nwTXTXw=wTCw\text{Var}(\mathbf{Xw}) = \frac{1}{n}\mathbf{w}^T\mathbf{X}^T\mathbf{Xw} = \mathbf{w}^T\mathbf{C}\mathbf{w}

where C = (1/n)X^T X is covariance matrix

Step 3: Lagrangian with constraint ||w||=1

L=wTCwλ(wTw1)\mathcal{L} = \mathbf{w}^T\mathbf{C}\mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)

Step 4: Take derivative and set to zero

Lw=2Cw2λw=0\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{C}\mathbf{w} - 2\lambda\mathbf{w} = 0
Cw=λw\mathbf{C}\mathbf{w} = \lambda\mathbf{w}

w is eigenvector of C, λ is eigenvalue!

Step 5: Variance equals eigenvalue

Var(Xw)=wTCw=wT(λw)=λ\text{Var}(\mathbf{Xw}) = \mathbf{w}^T\mathbf{C}\mathbf{w} = \mathbf{w}^T(\lambda\mathbf{w}) = \lambda

Conclusion: Choose eigenvectors with largest eigenvalues to maximize variance!

Reconstruction Error Equivalence

Theorem: Maximizing variance ⇔ Minimizing reconstruction error

minWixiWWTxi2\min_{\mathbf{W}} \sum_i \|\mathbf{x}_i - \mathbf{W}\mathbf{W}^T\mathbf{x}_i\|^2

PCA finds best k-dimensional linear subspace in least-squares sense.

SVD Connection

Singular Value Decomposition: X = UΣV^T

  • • Right singular vectors V are principal components
  • • Singular values σ_i: eigenvalues = σ_i²/n
  • • No need to compute C explicitly! More numerically stable.

In practice, use SVD for PCA computation

t-SNE: Cost Function & Gradient

Probabilistic Neighborhood Preservation

High-Dimensional Similarities

Gaussian similarity (conditional probability):

pji=exp(xixj2/2σi2)kiexp(xixk2/2σi2)p_{j|i} = \frac{\exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|\mathbf{x}_i - \mathbf{x}_k\|^2 / 2\sigma_i^2)}

Symmetric version: p_ij = (p_j|i + p_i|j) / 2n

Low-Dimensional Similarities (Student-t)

qij=(1+yiyj2)1kl(1+ykyl2)1q_{ij} = \frac{(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2)^{-1}}{\sum_{k \neq l}(1 + \|\mathbf{y}_k - \mathbf{y}_l\|^2)^{-1}}

Why Student-t (heavy-tailed)? Allows moderate distances in high-D to map to larger distances in low-D, alleviating crowding problem.

Cost Function (KL Divergence)

C=iKL(PiQi)=ijpijlogpijqijC = \sum_i \text{KL}(P_i \| Q_i) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}

Minimize difference between high-D and low-D probability distributions

Gradient for Optimization

Cyi=4j(pijqij)(yiyj)(1+yiyj2)1\frac{\partial C}{\partial \mathbf{y}_i} = 4\sum_j (p_{ij} - q_{ij})(\mathbf{y}_i - \mathbf{y}_j)(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2)^{-1}

Optimized using gradient descent with momentum. Complexity: O(n²) per iteration.

Kernel PCA: Non-linear Extension

PCA in Feature Space via Kernel Trick

Idea

Apply PCA in high-dimensional feature space φ(x) without explicit mapping:

xϕϕ(x)PCAlow-dim\mathbf{x} \xrightarrow{\phi} \phi(\mathbf{x}) \xrightarrow{\text{PCA}} \text{low-dim}

Kernel Matrix Formulation

In feature space, covariance: C = (1/n)Φ^T Φ where Φ = [φ(x_1)...φ(x_n)]

Principal components: Cv = λv becomes:

1nΦTΦv=λv\frac{1}{n}\boldsymbol{\Phi}^T\boldsymbol{\Phi}\mathbf{v} = \lambda\mathbf{v}

Multiply by Φ: (1/n)ΦΦ^T(Φv) = λ(Φv)

Kernel Matrix:

K=ΦΦT,Kij=ϕ(xi)Tϕ(xj)=k(xi,xj)\mathbf{K} = \boldsymbol{\Phi}\boldsymbol{\Phi}^T, \quad K_{ij} = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j) = k(\mathbf{x}_i, \mathbf{x}_j)

Solve eigenvalue problem: Kα = nλα, then v = Φ^T α

Projection of New Point

z=vTϕ(xnew)=αTΦϕ(xnew)=iαik(xi,xnew)z = \mathbf{v}^T\phi(\mathbf{x}_{\text{new}}) = \boldsymbol{\alpha}^T\boldsymbol{\Phi}\phi(\mathbf{x}_{\text{new}}) = \sum_i \alpha_i k(\mathbf{x}_i, \mathbf{x}_{\text{new}})

Only kernel evaluations needed! Never compute φ explicitly.

LDA: Fisher's Linear Discriminant

Maximizing Class Separability

Scatter Matrices

Within-class scatter:

SW=c=1Cxc(xμc)(xμc)TS_W = \sum_{c=1}^{C} \sum_{\mathbf{x} \in c} (\mathbf{x} - \boldsymbol{\mu}_c)(\mathbf{x} - \boldsymbol{\mu}_c)^T

Measures spread within each class

Between-class scatter:

SB=c=1Cnc(μcμ)(μcμ)TS_B = \sum_{c=1}^{C} n_c(\boldsymbol{\mu}_c - \boldsymbol{\mu})(\boldsymbol{\mu}_c - \boldsymbol{\mu})^T

Measures separation between class means

Fisher's Criterion

Find projection w that maximizes:

J(w)=wTSBwwTSWwJ(\mathbf{w}) = \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}}

Ratio of between-class to within-class variance

Solution via Generalized Eigenvalue Problem

Taking derivative and setting to zero:

SBw=λSWwS_B \mathbf{w} = \lambda S_W \mathbf{w}

If S_W invertible:

SW1SBw=λwS_W^{-1} S_B \mathbf{w} = \lambda \mathbf{w}

Solution: Eigenvectors of S_W^(-1)S_B with largest eigenvalues

Dimensionality Limit

Theorem: LDA can extract at most C-1 features (C = number of classes)

Proof: S_B is sum of C rank-1 matrices centered at overall mean, so rank(S_B) ≤ C-1

Autoencoders for Non-linear Dimensionality Reduction

Neural Network-Based Dimensionality Reduction

Architecture

Encoder: x → h = f_encoder(x; θ_enc)
Decoder: h → x̂ = f_decoder(h; θ_dec)

minθenc,θdecixifdec(fenc(xi))2\min_{\theta_{\text{enc}}, \theta_{\text{dec}}} \sum_i \|\mathbf{x}_i - f_{\text{dec}}(f_{\text{enc}}(\mathbf{x}_i))\|^2

Relation to PCA

Theorem: Linear autoencoder (no activation functions) with MSE loss learns same subspace as PCA

  • • Encoder W_enc, decoder W_dec
  • • At optimum: span(W_enc) = span(first k eigenvectors of C)
  • • Non-linear activations → can learn non-linear manifolds

Variational Autoencoder (VAE)

Probabilistic version: encode to distribution, not point

L=Eq(zx)[logp(xz)]DKL(q(zx)p(z))\mathcal{L} = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})] - D_{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))

Learns smooth latent space, enables generation of new samples!

Manifold Learning Theory

Non-linear Methods: ISOMAP, LLE, UMAP

Manifold Hypothesis

Assumption: High-dimensional data lies on or near a low-dimensional manifold embedded in high-D space

Example: Images of faces vary smoothly with pose, lighting → lie on low-D manifold in pixel space

ISOMAP: Geodesic Distances

  1. Build k-NN graph
  2. Compute shortest path distances (geodesic) using Dijkstra/Floyd-Warshall
  3. Apply classical MDS on geodesic distance matrix

Preserves intrinsic geometry of manifold, not Euclidean distance

LLE: Local Linear Embedding

Step 1: Find weights W_ij that reconstruct x_i from neighbors:

minWixijN(i)Wijxj2\min_W \sum_i \left\|\mathbf{x}_i - \sum_{j \in N(i)} W_{ij}\mathbf{x}_j\right\|^2

Step 2: Find low-D embedding y preserving same weights:

minYiyijN(i)Wijyj2\min_Y \sum_i \left\|\mathbf{y}_i - \sum_{j \in N(i)} W_{ij}\mathbf{y}_j\right\|^2

Preserves local linear structure!

Example: MNIST Digit Visualization

Task: Visualize 70,000 handwritten digits (28×28 pixels = 784 dimensions) in 2D

PCA

Fast (seconds), shows global structure, some digit separation

t-SNE

Slower (minutes), beautiful clusters, clear digit separation

Autoencoder

Training required, smooth manifold, can generate new digits

UMAP: Uniform Manifold Approximation

Modern Alternative to t-SNE

Mathematical Foundation

Based on Riemannian geometry and algebraic topology

Key insight: Real data uniformly distributed on locally connected manifold

High-Dimensional Graph Construction

Fuzzy simplicial set membership:

w(xi,xj)=exp(d(xi,xj)ρiσi)w(x_i, x_j) = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)

• ρ_i = distance to nearest neighbor
• σ_i chosen so local neighborhood has fixed perplexity
• Symmetrize: v(x_i, x_j) = w(x_i, x_j) + w(x_j, x_i) - w(x_i, x_j)w(x_j, x_i)

Low-Dimensional Optimization

Cross-entropy loss:

ijvijlogvijwij+(1vij)log1vij1wij\sum_{ij} v_{ij}\log\frac{v_{ij}}{w'_{ij}} + (1-v_{ij})\log\frac{1-v_{ij}}{1-w'_{ij}}

where w'_ij = (1 + ||y_i - y_j||²)^(-1) in low-D space

UMAP vs t-SNE

t-SNE

  • • Focuses on local structure
  • • Slower (O(n²) or O(n log n))
  • • Non-deterministic
  • • No transform for new data

UMAP

  • • Preserves both local & global
  • • Faster (O(n))
  • • More reproducible
  • • Can transform new data

Key Parameters

  • n_neighbors: Size of local neighborhood (default: 15). Larger → more global structure.
  • min_dist: Minimum distance between points in low-D (default: 0.1). Smaller → tighter clusters.
  • metric: Distance function (Euclidean, cosine, etc.)

Choosing the Right Method

Decision Guide for Dimensionality Reduction

Use PCA when:

  • • Need fast, scalable solution
  • • Linear relationships expected
  • • Want to transform new data
  • • Need interpretable components
  • • Example: Preprocessing for ML pipeline

Use t-SNE/UMAP when:

  • • Visualization is primary goal (2D/3D)
  • • Non-linear structure expected
  • • Want to reveal clusters
  • • Computational cost acceptable
  • • Example: Exploratory data analysis, publication figures

Use LDA when:

  • • Have labeled data
  • • Want to maximize class separation
  • • Preprocessing for classification
  • • Example: Face recognition, speech recognition

Use Autoencoder when:

  • • Have large dataset
  • • Complex non-linear manifold
  • • Need generative capability
  • • GPU resources available
  • • Example: Image compression, anomaly detection

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is the main goal of dimensionality reduction?
Not attempted
2
PCA finds:
Not attempted
3
Principal components are:
Not attempted
4
If top 3 PCs explain 95% of variance, this means:
Not attempted
5
t-SNE (t-Distributed Stochastic Neighbor Embedding) is primarily used for:
Not attempted
6
Compared to PCA, t-SNE:
Not attempted
7
LDA (Linear Discriminant Analysis) for dimensionality reduction:
Not attempted
8
Autoencoders for dimensionality reduction:
Not attempted
9
The curse of dimensionality refers to:
Not attempted
10
Which method is NOT suitable for reducing 1000 dimensions to 2 for visualization?
Not attempted