Machine Learning/Learning Center/Feature Selection & Sparse Learning/Sparse Representation

Sparse Representation and Dictionary Learning

Learn how to transform dense data into sparse representations through dictionary learning. Master the optimization framework that learns dictionaries and sparse coefficients simultaneously for efficient high-dimensional data processing.

Module 5 of 8

Intermediate to Advanced

120-150 min

What is Sparse Representation?

Sparse representation expresses data as a linear combination of a small number of basis vectors (atoms) from a dictionary. Most coefficients are zero, making the representation efficient to store and compute with.

Key Concept

Given a sample $x \in \mathbb{R}^n$ and dictionary $B \in \mathbb{R}^{n \times k}$ , find sparse coefficient $\alpha \in \mathbb{R}^k$ such that:

x \approx B\alpha

where $\alpha$ has mostly zeros (sparse).

Advantages

• Efficient storage
• Fast computation
• Better linear separability
• Interpretable (few atoms)
• Noise reduction

Challenges

• Dictionary size selection
• Iterative optimization
• Computational cost
• Local optima
• Requires tuning

Dictionary Learning Problem

Given dense samples $\{x_1, x_2, \ldots, x_m\}$ where $x_i \in \mathbb{R}^n$ , learn:

1. Dictionary Matrix B

Dictionary $B \in \mathbb{R}^{n \times k}$ with $k$ atoms (columns). Each atom $b_j$ is a basis vector.

$k$ is user-specified (dictionary size). Typically $k > n$ (overcomplete dictionary).

2. Sparse Coefficients α_i

For each sample $x_i$ , find sparse coefficient $\alpha_i \in \mathbb{R}^k$ such that $x_i \approx B\alpha_i$ .

Most elements of $\alpha_i$ should be zero (sparse).

Optimization Objective

Dictionary learning minimizes reconstruction error plus sparsity penalty:

\min_{B, \{\alpha_i\}} \sum_{i=1}^m \|x_i - B\alpha_i\|_2^2 + \lambda \sum_{i=1}^m \|\alpha_i\|_1

First term: reconstruction error. Second term: L1 sparsity penalty (same as LASSO).

Alternating Optimization Algorithm

Dictionary learning has no closed-form solution. We use alternating optimization: fix one variable, optimize the other, and iterate.

Step 1

Fix Dictionary B, Optimize α_i

For each sample $x_i$ , solve:

\min_{\alpha_i} \|x_i - B\alpha_i\|_2^2 + \lambda \|\alpha_i\|_1

This is a LASSO problem! Use Proximal Gradient Descent (PGD) to solve.

Step 2

Fix α_i, Optimize Dictionary B

With fixed $\{\alpha_i\}$ , optimize:

\min_B \|X - BA\|_F^2

where $X = [x_1, \ldots, x_m]$ , $A = [\alpha_1, \ldots, \alpha_m]$ , and $\|\cdot\|_F$ is the Frobenius norm.

Update each dictionary atom $b_j$ using SVD.

Step 3

Iterate Until Convergence

Repeat steps 1-2 until reconstruction error converges. The algorithm alternates between sparse coding (step 1) and dictionary update (step 2).

Dictionary Update via SVD

To update dictionary atom $b_j$ , we decompose the optimization:

Decomposition Strategy

For each atom $b_j$ (j-th column of B), fix all other atoms and optimize:

\min_{b_j} \|E_j - b_j \alpha^j\|_F^2

where $E_j = X - \sum_{l \neq j} b_l \alpha^l$ is the residual matrix (error when using all atoms except $b_j$ ), and $\alpha^j$ is the j-th row of $A$ (coefficients for atom $b_j$ ).

Solution: Perform SVD on $E_j$ . The optimal $b_j$ is the left singular vector corresponding to the largest singular value. This ensures the dictionary atom best explains the residual.

Algorithm for Dictionary Update

For each atom $j = 1, 2, \ldots, k$ :
- Compute residual: $E_j = X - \sum_{l \neq j} b_l \alpha^l$
- Perform SVD: $E_j = U \Sigma V^T$
- Update: $b_j = u_1$ (first left singular vector)
Normalize atoms: $b_j \leftarrow b_j / \|b_j\|_2$ (optional, for stability)

Practical Example: Image Patches

Dictionary learning is widely used in image processing. Consider learning a dictionary for 8×8 image patches (64-dimensional vectors).

Problem Setup

Extract 10,000 image patches (8×8 pixels = 64 dimensions) from natural images. Learn dictionary with $k = 256$ atoms.

Dictionary Learning Result

After alternating optimization:

• Dictionary atoms resemble Gabor-like filters (edges, textures)
• Each patch is represented by ~5-10 non-zero coefficients (sparse)
• Reconstruction error: $< 5\%$
• Storage: 64×256 dictionary + sparse coefficients vs. dense 64×10,000

Applications

Learned dictionary enables image denoising, compression, and inpainting. Sparse representation separates signal from noise and enables efficient storage.

Key Takeaways

Sparse representation expresses data as a linear combination of few dictionary atoms, enabling efficient storage and computation.

Dictionary learning optimizes both dictionary and sparse coefficients simultaneously through alternating optimization.

Sparse coding (step 1) is a LASSO problem solved with PGD; dictionary update (step 2) uses SVD to find optimal atoms.

Dictionary learning is widely used in image processing, text analysis, and signal processing for denoising and compression.

The L1 sparsity penalty ensures most coefficients are zero, creating interpretable and efficient representations.

Overcomplete dictionaries ( $k > n$ ) provide flexibility for sparse representation of diverse data patterns.

Next Module