Master PCA, t-SNE, LDA, and autoencoders for feature extraction and visualization
Linear, unsupervised. Finds directions of maximum variance.
✓ Fast, generalizes to new data | ✗ Linear only
Non-linear, for visualization. Preserves local structure.
✓ Beautiful visualizations, reveals clusters | ✗ Slow, doesn't generalize
Supervised. Maximizes class separation.
✓ Uses labels, good for classification | ✗ Max k=C-1 components
Deep learning. Encoder-decoder with bottleneck.
✓ Non-linear, flexible | ✗ Requires training, complexity
As d → ∞, nearly all volume concentrates near the surface! Center becomes "empty".
For random points in d-dimensional unit cube:
Relative variance: Var/E² = O(1/d) → 0. All distances become nearly equal!
To maintain same sample density:
For ρ=10 samples per dimension: d=2 needs 100 samples, d=10 needs 10 billion!
Given: Data matrix X ∈ R^(n×d), find linear projection w ∈ R^d (unit vector) that maximizes variance of projected data
Step 1: Center data
X_centered = X - mean(X)
Step 2: Express variance
where C = (1/n)X^T X is covariance matrix
Step 3: Lagrangian with constraint ||w||=1
Step 4: Take derivative and set to zero
w is eigenvector of C, λ is eigenvalue!
Step 5: Variance equals eigenvalue
Conclusion: Choose eigenvectors with largest eigenvalues to maximize variance!
Theorem: Maximizing variance ⇔ Minimizing reconstruction error
PCA finds best k-dimensional linear subspace in least-squares sense.
Singular Value Decomposition: X = UΣV^T
In practice, use SVD for PCA computation
Gaussian similarity (conditional probability):
Symmetric version: p_ij = (p_j|i + p_i|j) / 2n
Why Student-t (heavy-tailed)? Allows moderate distances in high-D to map to larger distances in low-D, alleviating crowding problem.
Minimize difference between high-D and low-D probability distributions
Optimized using gradient descent with momentum. Complexity: O(n²) per iteration.
Apply PCA in high-dimensional feature space φ(x) without explicit mapping:
In feature space, covariance: C = (1/n)Φ^T Φ where Φ = [φ(x_1)...φ(x_n)]
Principal components: Cv = λv becomes:
Multiply by Φ: (1/n)ΦΦ^T(Φv) = λ(Φv)
Kernel Matrix:
Solve eigenvalue problem: Kα = nλα, then v = Φ^T α
Only kernel evaluations needed! Never compute φ explicitly.
Within-class scatter:
Measures spread within each class
Between-class scatter:
Measures separation between class means
Find projection w that maximizes:
Ratio of between-class to within-class variance
Taking derivative and setting to zero:
If S_W invertible:
Solution: Eigenvectors of S_W^(-1)S_B with largest eigenvalues
Theorem: LDA can extract at most C-1 features (C = number of classes)
Proof: S_B is sum of C rank-1 matrices centered at overall mean, so rank(S_B) ≤ C-1
Encoder: x → h = f_encoder(x; θ_enc)
Decoder: h → x̂ = f_decoder(h; θ_dec)
Theorem: Linear autoencoder (no activation functions) with MSE loss learns same subspace as PCA
Probabilistic version: encode to distribution, not point
Learns smooth latent space, enables generation of new samples!
Assumption: High-dimensional data lies on or near a low-dimensional manifold embedded in high-D space
Example: Images of faces vary smoothly with pose, lighting → lie on low-D manifold in pixel space
Preserves intrinsic geometry of manifold, not Euclidean distance
Step 1: Find weights W_ij that reconstruct x_i from neighbors:
Step 2: Find low-D embedding y preserving same weights:
Preserves local linear structure!
Task: Visualize 70,000 handwritten digits (28×28 pixels = 784 dimensions) in 2D
PCA
Fast (seconds), shows global structure, some digit separation
t-SNE
Slower (minutes), beautiful clusters, clear digit separation
Autoencoder
Training required, smooth manifold, can generate new digits
Based on Riemannian geometry and algebraic topology
Key insight: Real data uniformly distributed on locally connected manifold
Fuzzy simplicial set membership:
• ρ_i = distance to nearest neighbor
• σ_i chosen so local neighborhood has fixed perplexity
• Symmetrize: v(x_i, x_j) = w(x_i, x_j) + w(x_j, x_i) - w(x_i, x_j)w(x_j, x_i)
Cross-entropy loss:
where w'_ij = (1 + ||y_i - y_j||²)^(-1) in low-D space
t-SNE
UMAP
Use PCA when:
Use t-SNE/UMAP when:
Use LDA when:
Use Autoencoder when:
Test your understanding with 10 multiple-choice questions