MathIsimple

Sparse Representation and Dictionary Learning

Learn how to transform dense data into sparse representations through dictionary learning. Master the optimization framework that learns dictionaries and sparse coefficients simultaneously for efficient high-dimensional data processing.

Module 5 of 8
Intermediate to Advanced
120-150 min

What is Sparse Representation?

Sparse representation expresses data as a linear combination of a small number of basis vectors (atoms) from a dictionary. Most coefficients are zero, making the representation efficient to store and compute with.

Key Concept

Given a sample xRnx \in \mathbb{R}^n and dictionaryBRn×kB \in \mathbb{R}^{n \times k}, find sparse coefficientαRk\alpha \in \mathbb{R}^k such that:

xBαx \approx B\alpha

where α\alpha has mostly zeros (sparse).

Advantages

  • • Efficient storage
  • • Fast computation
  • • Better linear separability
  • • Interpretable (few atoms)
  • • Noise reduction

Challenges

  • • Dictionary size selection
  • • Iterative optimization
  • • Computational cost
  • • Local optima
  • • Requires tuning

Dictionary Learning Problem

Given dense samples {x1,x2,,xm}\{x_1, x_2, \ldots, x_m\} wherexiRnx_i \in \mathbb{R}^n, learn:

1. Dictionary Matrix B

Dictionary BRn×kB \in \mathbb{R}^{n \times k} withkk atoms (columns). Each atom bjb_jis a basis vector.

kk is user-specified (dictionary size). Typicallyk>nk > n (overcomplete dictionary).

2. Sparse Coefficients α_i

For each sample xix_i, find sparse coefficientαiRk\alpha_i \in \mathbb{R}^k such thatxiBαix_i \approx B\alpha_i.

Most elements of αi\alpha_i should be zero (sparse).

Optimization Objective

Dictionary learning minimizes reconstruction error plus sparsity penalty:

minB,{αi}i=1mxiBαi22+λi=1mαi1\min_{B, \{\alpha_i\}} \sum_{i=1}^m \|x_i - B\alpha_i\|_2^2 + \lambda \sum_{i=1}^m \|\alpha_i\|_1

First term: reconstruction error. Second term: L1 sparsity penalty (same as LASSO).

Alternating Optimization Algorithm

Dictionary learning has no closed-form solution. We use alternating optimization: fix one variable, optimize the other, and iterate.

Step 1

Fix Dictionary B, Optimize α_i

For each sample xix_i, solve:

minαixiBαi22+λαi1\min_{\alpha_i} \|x_i - B\alpha_i\|_2^2 + \lambda \|\alpha_i\|_1

This is a LASSO problem! Use Proximal Gradient Descent (PGD) to solve.

Step 2

Fix α_i, Optimize Dictionary B

With fixed {αi}\{\alpha_i\}, optimize:

minBXBAF2\min_B \|X - BA\|_F^2

where X=[x1,,xm]X = [x_1, \ldots, x_m],A=[α1,,αm]A = [\alpha_1, \ldots, \alpha_m], andF\|\cdot\|_F is the Frobenius norm.

Update each dictionary atom bjb_j using SVD.

Step 3

Iterate Until Convergence

Repeat steps 1-2 until reconstruction error converges. The algorithm alternates between sparse coding (step 1) and dictionary update (step 2).

Dictionary Update via SVD

To update dictionary atom bjb_j, we decompose the optimization:

Decomposition Strategy

For each atom bjb_j (j-th column of B), fix all other atoms and optimize:

minbjEjbjαjF2\min_{b_j} \|E_j - b_j \alpha^j\|_F^2

where Ej=XljblαlE_j = X - \sum_{l \neq j} b_l \alpha^l is the residual matrix (error when using all atoms except bjb_j), andαj\alpha^j is the j-th row of AA(coefficients for atom bjb_j).

Solution: Perform SVD on EjE_j. The optimalbjb_j is the left singular vector corresponding to the largest singular value. This ensures the dictionary atom best explains the residual.

Algorithm for Dictionary Update

  1. For each atom j=1,2,,kj = 1, 2, \ldots, k:
    • Compute residual: Ej=XljblαlE_j = X - \sum_{l \neq j} b_l \alpha^l
    • Perform SVD: Ej=UΣVTE_j = U \Sigma V^T
    • Update: bj=u1b_j = u_1 (first left singular vector)
  2. Normalize atoms: bjbj/bj2b_j \leftarrow b_j / \|b_j\|_2 (optional, for stability)

Practical Example: Image Patches

Dictionary learning is widely used in image processing. Consider learning a dictionary for 8×8 image patches (64-dimensional vectors).

Problem Setup

Extract 10,000 image patches (8×8 pixels = 64 dimensions) from natural images. Learn dictionary with k=256k = 256 atoms.

Dictionary Learning Result

After alternating optimization:

  • • Dictionary atoms resemble Gabor-like filters (edges, textures)
  • • Each patch is represented by ~5-10 non-zero coefficients (sparse)
  • • Reconstruction error: <5%< 5\%
  • • Storage: 64×256 dictionary + sparse coefficients vs. dense 64×10,000

Applications

Learned dictionary enables image denoising, compression, and inpainting. Sparse representation separates signal from noise and enables efficient storage.

Key Takeaways

Sparse representation expresses data as a linear combination of few dictionary atoms, enabling efficient storage and computation.

Dictionary learning optimizes both dictionary and sparse coefficients simultaneously through alternating optimization.

Sparse coding (step 1) is a LASSO problem solved with PGD; dictionary update (step 2) uses SVD to find optimal atoms.

Dictionary learning is widely used in image processing, text analysis, and signal processing for denoising and compression.

The L1 sparsity penalty ensures most coefficients are zero, creating interpretable and efficient representations.

Overcomplete dictionaries (k>nk > n) provide flexibility for sparse representation of diverse data patterns.