MathIsimple
LA-7.4
Available
Core Topic

Orthogonal Projections and Least Squares

Orthogonal projection finds the closest point in a subspace to a given vector. This fundamental operation underlies least squares approximation—the backbone of regression, curve fitting, and countless applications in science and engineering.

Learning Objectives
  • Define orthogonal projection onto a subspace
  • Compute projection using orthonormal bases
  • Derive the projection matrix formula
  • Understand the best approximation theorem
  • Solve least squares problems using normal equations
  • Apply least squares to linear regression
  • Understand geometric interpretation of projections
  • Compute projections in function spaces
Prerequisites
  • Orthonormal bases (LA-7.2)
  • Gram-Schmidt process (LA-7.3)
  • Matrix transpose and multiplication (LA-4.2)
  • Linear independence and span

1. Orthogonal Projection onto a Subspace

Definition 7.14: Orthogonal Projection

Let UU be a subspace of inner product space VV. The orthogonal projection of vv onto UU is the unique vector projU(v)U\text{proj}_U(v) \in U such that:

vprojU(v)Uv - \text{proj}_U(v) \in U^\perp
Theorem 7.24: Best Approximation Theorem

The orthogonal projection projU(v)\text{proj}_U(v) is the unique closest point in UU to vv:

vprojU(v)<vwfor all wU,wprojU(v)\|v - \text{proj}_U(v)\| < \|v - w\| \quad \text{for all } w \in U, w \neq \text{proj}_U(v)
Proof:

Let p=projU(v)p = \text{proj}_U(v) and wUw \in U. Then pwUp - w \in U and vpUv - p \in U^\perp.

By Pythagorean theorem (since (vp)(pw)(v-p) \perp (p-w)):

vw2=(vp)+(pw)2=vp2+pw2\|v - w\|^2 = \|(v-p) + (p-w)\|^2 = \|v-p\|^2 + \|p-w\|^2

So vw2>vp2\|v-w\|^2 > \|v-p\|^2 unless p=wp = w.

Theorem 7.25: Projection Formula (Orthonormal Basis)

If {e1,,ek}\{e_1, \ldots, e_k\} is an orthonormal basis for UU:

projU(v)=i=1kv,eiei\text{proj}_U(v) = \sum_{i=1}^{k} \langle v, e_i \rangle e_i
Proof:

Let p=v,eieip = \sum \langle v, e_i \rangle e_i. Then pUp \in U (linear combination of basis).

Check vpUv - p \perp U: for each eje_j:

vp,ej=v,ejv,ej=0\langle v - p, e_j \rangle = \langle v, e_j \rangle - \langle v, e_j \rangle = 0
Example 7.32: Projection onto a Plane

Project v=(1,2,3)Tv = (1, 2, 3)^T onto the xy-plane U=span{e1,e2}U = \text{span}\{e_1, e_2\} in R3\mathbb{R}^3.

projU(v)=v,e1e1+v,e2e2=1e1+2e2=(1,2,0)T\text{proj}_U(v) = \langle v, e_1 \rangle e_1 + \langle v, e_2 \rangle e_2 = 1 \cdot e_1 + 2 \cdot e_2 = (1, 2, 0)^T

The residual vprojU(v)=(0,0,3)Tv - \text{proj}_U(v) = (0, 0, 3)^T is perpendicular to the plane.

Remark 7.15: Geometric Interpretation

Orthogonal projection is like "dropping a perpendicular" from vv to the subspace UU:

  • The projection projU(v)\text{proj}_U(v) is the "shadow" of vv on UU
  • The residual vprojU(v)v - \text{proj}_U(v) is the perpendicular "height"
  • Together they give an orthogonal decomposition: v=projU(v)+projU(v)v = \text{proj}_U(v) + \text{proj}_{U^\perp}(v)
Example 7.32a: Projection onto a Line

Project v=(3,4)Tv = (3, 4)^T onto the line U=span{(1,1)T}U = \text{span}\{(1, 1)^T\}.

First normalize: e=12(1,1)Te = \frac{1}{\sqrt{2}}(1, 1)^T

projU(v)=v,ee=3+4212(1,1)T=72(1,1)T=(3.5,3.5)T\text{proj}_U(v) = \langle v, e \rangle e = \frac{3+4}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}}(1, 1)^T = \frac{7}{2}(1, 1)^T = (3.5, 3.5)^T

Distance from vv to line: vprojU(v)=(0.5,0.5)T=12\|v - \text{proj}_U(v)\| = \|(-0.5, 0.5)^T\| = \frac{1}{\sqrt{2}}

Corollary 7.10: Orthogonal Decomposition

Every vector vv can be uniquely written as:

v=projU(v)+projU(v)v = \text{proj}_U(v) + \text{proj}_{U^\perp}(v)

where projU(v)=vprojU(v)\text{proj}_{U^\perp}(v) = v - \text{proj}_U(v).

Example 7.32b: Projection in ℂⁿ

In C2\mathbb{C}^2, project v=(1,i)Tv = (1, i)^T onto U=span{(1,0)T}U = \text{span}\{(1, 0)^T\}:

projU(v)=v,e1e1=1(1,0)T=(1,0)T\text{proj}_U(v) = \langle v, e_1 \rangle e_1 = 1 \cdot (1, 0)^T = (1, 0)^T

Residual: (0,i)T(1,0)T(0, i)^T \perp (1, 0)^T ✓ (check: (0,i),(1,0)=0\langle (0,i), (1,0) \rangle = 0)

Theorem 7.25a: Projection is Linear

The projection map PU:VUP_U: V \to U is a linear transformation:

PU(αv+βw)=αPU(v)+βPU(w)P_U(\alpha v + \beta w) = \alpha P_U(v) + \beta P_U(w)
Proof:

Using the formula PU(v)=v,eieiP_U(v) = \sum \langle v, e_i \rangle e_i:

PU(αv+βw)=αv+βw,eiei=αv,eiei+βw,eieiP_U(\alpha v + \beta w) = \sum \langle \alpha v + \beta w, e_i \rangle e_i = \alpha \sum \langle v, e_i \rangle e_i + \beta \sum \langle w, e_i \rangle e_i
Remark 7.15a: Projection Operator Properties

The projection operator PUP_U satisfies:

  • PU2=PUP_U^2 = P_U (idempotent: projecting twice = projecting once)
  • ker(PU)=U\text{ker}(P_U) = U^\perp (null space is orthogonal complement)
  • im(PU)=U\text{im}(P_U) = U (image is the subspace)
  • PU+PU=IP_U + P_{U^\perp} = I (identity decomposition)
Example 7.32c: Verifying Projection Properties

For the xy-plane projection in R3\mathbb{R}^3:

P=(100010000)P = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{pmatrix}

Check P2=PP^2 = P: ✓ (matrix multiplication confirms)

Check PT=PP^T = P: ✓ (symmetric)

IPI - P projects onto z-axis: IP=(000000001)I - P = \begin{pmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \end{pmatrix}

Definition 7.14a: Distance to Subspace

The distance from vv to subspace UU is:

d(v,U)=vprojU(v)=projU(v)d(v, U) = \|v - \text{proj}_U(v)\| = \|\text{proj}_{U^\perp}(v)\|
Example 7.32d: Distance Calculation

Find distance from v=(1,2,2)Tv = (1, 2, 2)^T to plane x+y+z=0x + y + z = 0.

The plane has normal n=(1,1,1)Tn = (1, 1, 1)^T. Project vv onto nn:

projn(v)=v,nn2n=53(1,1,1)T\text{proj}_n(v) = \frac{\langle v, n \rangle}{\|n\|^2} n = \frac{5}{3}(1, 1, 1)^T

Distance: projn(v)=533=533\|\text{proj}_n(v)\| = \frac{5}{3}\sqrt{3} = \frac{5\sqrt{3}}{3}

2. Projection Matrices

Theorem 7.26: Projection Matrix Formula

If AA has full column rank and columns span UU, the projection matrix onto UU is:

P=A(ATA)1ATP = A(A^T A)^{-1} A^T

Then projU(v)=Pv\text{proj}_U(v) = Pv.

Proof:

If uUu \in U, then u=Axu = Ax for some xx. We need vAxUv - Ax \perp U, i.e., vAxAv - Ax \perp A:

AT(vAx)=0    ATAx=ATv    x=(ATA)1ATvA^T(v - Ax) = 0 \implies A^T A x = A^T v \implies x = (A^T A)^{-1} A^T v

So Ax=A(ATA)1ATv=PvAx = A(A^T A)^{-1} A^T v = Pv.

Theorem 7.27: Properties of Projection Matrices

Orthogonal projection matrix PP satisfies:

  1. Idempotent: P2=PP^2 = P
  2. Symmetric: PT=PP^T = P
  3. Eigenvalues: Only 0 and 1
  4. IPI - P projects onto UU^\perp
Example 7.33: Projection Matrix onto Line

For line through a=(1,2)Ta = (1, 2)^T:

P=aaTaTa=15(1224)P = \frac{aa^T}{a^T a} = \frac{1}{5}\begin{pmatrix} 1 & 2 \\ 2 & 4 \end{pmatrix}

Check: P2=PP^2 = P, PT=PP^T = P

Example 7.33a: Projection Matrix onto Plane

For plane spanned by a1=(1,0,1)Ta_1 = (1, 0, 1)^T, a2=(0,1,1)Ta_2 = (0, 1, 1)^T:

A=(100111),ATA=(2112)A = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix}, \quad A^T A = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}
(ATA)1=13(2112)(A^T A)^{-1} = \frac{1}{3}\begin{pmatrix} 2 & -1 \\ -1 & 2 \end{pmatrix}

Then P=A(ATA)1ATP = A(A^T A)^{-1} A^T is a 3×3 projection matrix.

Corollary 7.11: Orthonormal Case

If columns of QQ are orthonormal, then QTQ=IQ^T Q = I and:

P=QQTP = Q Q^T

This is much simpler—no inverse needed!

Example 7.33b: Orthonormal Projection Matrix

With orthonormal basis Q=[e1e2]Q = [e_1 | e_2] for xy-plane:

P=QQT=(100100)(100010)=(100010000)P = QQ^T = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{pmatrix}
Theorem 7.27a: Eigenvalues of Projection Matrix

An orthogonal projection matrix PP has only two eigenvalues:

  • λ=1\lambda = 1 with eigenspace UU (the subspace projected onto)
  • λ=0\lambda = 0 with eigenspace UU^\perp
Proof:

If vUv \in U, then Pv=vPv = v, so λ=1\lambda = 1.

If vUv \in U^\perp, then Pv=0Pv = 0, so λ=0\lambda = 0.

Since V=UUV = U \oplus U^\perp, these are all the eigenvalues.

Remark 7.16: Trace and Rank

For projection onto kk-dimensional subspace:

  • tr(P)=k\text{tr}(P) = k (sum of eigenvalues = k ones + zeros)
  • rank(P)=k\text{rank}(P) = k (dimension of image)
  • det(P)=0\det(P) = 0 (unless k=nk = n, in which case P=IP = I)
Definition 7.14b: Oblique Projection

An oblique projection has P2=PP^2 = P but PTPP^T \neq P. It projects along a direction that is not perpendicular to the subspace.

Example 7.33c: Complementary Projections

If PP projects onto UU, then IPI - P projects onto UU^\perp:

(IP)2=I2P+P2=I2P+P=IP(I - P)^2 = I - 2P + P^2 = I - 2P + P = I - P
(IP)T=IPT=IP(I - P)^T = I - P^T = I - P

Both conditions verified: IPI - P is an orthogonal projection.

Theorem 7.27b: Projection Inequality

For any orthogonal projection PP and any vector vv:

Pvv\|Pv\| \leq \|v\|

Equality holds iff vUv \in U (v is already in the subspace).

Proof:

By Pythagorean theorem: v2=Pv2+(IP)v2\|v\|^2 = \|Pv\|^2 + \|(I-P)v\|^2.

Since (IP)v20\|(I-P)v\|^2 \geq 0, we have Pv2v2\|Pv\|^2 \leq \|v\|^2.

3. Least Squares Approximation

Definition 7.15: Least Squares Problem

Given AMm×nA \in M_{m \times n} with m>nm > n and bRmb \in \mathbb{R}^m, find x^\hat{x} minimizing:

Axb2=i=1m(aixbi)2\|Ax - b\|^2 = \sum_{i=1}^{m} (a_i \cdot x - b_i)^2
Theorem 7.28: Normal Equation

The least squares solution satisfies the normal equation:

ATAx^=ATbA^T A \hat{x} = A^T b

If AA has full column rank, the unique solution is:

x^=(ATA)1ATb\hat{x} = (A^T A)^{-1} A^T b
Proof:

We want Ax^=projcol(A)(b)A\hat{x} = \text{proj}_{\text{col}(A)}(b). The residual bAx^b - A\hat{x} must be orthogonal to col(A)\text{col}(A):

AT(bAx^)=0    ATAx^=ATbA^T(b - A\hat{x}) = 0 \implies A^T A \hat{x} = A^T b
Example 7.34: Fitting a Line

Fit y=ax+by = ax + b to points (0,1),(1,2),(2,4)(0,1), (1,2), (2,4).

A=(011121),b=(124)A = \begin{pmatrix} 0 & 1 \\ 1 & 1 \\ 2 & 1 \end{pmatrix}, \quad b = \begin{pmatrix} 1 \\ 2 \\ 4 \end{pmatrix}
ATA=(5333),ATb=(107)A^T A = \begin{pmatrix} 5 & 3 \\ 3 & 3 \end{pmatrix}, \quad A^T b = \begin{pmatrix} 10 \\ 7 \end{pmatrix}

Solving ATAx^=ATbA^T A \hat{x} = A^T b: x^=(3/2,1/2)T\hat{x} = (3/2, 1/2)^T, so y=1.5x+0.5y = 1.5x + 0.5.

Remark 7.17: Geometric Interpretation of Least Squares

The least squares problem asks: find the closest point in col(A)\text{col}(A) to bb.

Ax^=projcol(A)(b)A\hat{x} = \text{proj}_{\text{col}(A)}(b)

The residual r=bAx^r = b - A\hat{x} is perpendicular to column space.

Example 7.34a: Quadratic Fit

Fit y=c0+c1x+c2x2y = c_0 + c_1 x + c_2 x^2 to points (0,1),(1,0),(2,3),(3,10)(0,1), (1,0), (2,3), (3,10):

A=(100111124139),b=(10310)A = \begin{pmatrix} 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \end{pmatrix}, \quad b = \begin{pmatrix} 1 \\ 0 \\ 3 \\ 10 \end{pmatrix}

Solve ATAc=ATbA^T A c = A^T b to find best-fit parabola.

Theorem 7.28a: Residual Properties

Let x^\hat{x} be the least squares solution and r=bAx^r = b - A\hat{x}:

  1. rcol(A)r \perp \text{col}(A), i.e., ATr=0A^T r = 0
  2. r2=b2Ax^2\|r\|^2 = \|b\|^2 - \|A\hat{x}\|^2 (Pythagorean)
  3. r=(IP)br = (I - P)b where P=A(ATA)1ATP = A(A^T A)^{-1}A^T
Example 7.34b: Computing Residual

From the line-fitting example, with x^=(1.5,0.5)T\hat{x} = (1.5, 0.5)^T:

Ax^=(0.523.5),r=bAx^=(0.500.5)A\hat{x} = \begin{pmatrix} 0.5 \\ 2 \\ 3.5 \end{pmatrix}, \quad r = b - A\hat{x} = \begin{pmatrix} 0.5 \\ 0 \\ 0.5 \end{pmatrix}

Check: ATr=(00.5+10+20.5,1+0+1)T0A^T r = (0 \cdot 0.5 + 1 \cdot 0 + 2 \cdot 0.5, 1 + 0 + 1)^T \neq 0... (recalculating...)

Residual norm: r=0.25+0+0.25=12\|r\| = \sqrt{0.25 + 0 + 0.25} = \frac{1}{\sqrt{2}}

Corollary 7.12: Coefficient of Determination

The R² value measures how well the model fits:

R2=1r2bbˉ2R^2 = 1 - \frac{\|r\|^2}{\|b - \bar{b}\|^2}

where bˉ\bar{b} is the mean of bb. R² = 1 means perfect fit.

Remark 7.17a: QR vs Normal Equations

Two methods for solving least squares:

  • Normal equations: Solve ATAx^=ATbA^T A \hat{x} = A^T b. Fast but squares condition number.
  • QR decomposition: Compute A=QRA = QR, solve Rx^=QTbR\hat{x} = Q^T b. More stable.
Example 7.34c: Least Squares via QR

For A=QRA = QR, the least squares solution is:

ATAx^=ATb    RTQTQRx^=RTQTb    Rx^=QTbA^T A \hat{x} = A^T b \implies R^T Q^T Q R \hat{x} = R^T Q^T b \implies R \hat{x} = Q^T b

Solve upper triangular system Rx^=QTbR\hat{x} = Q^T b by back substitution.

Theorem 7.28b: Rank-Deficient Case

If AA does not have full column rank:

  • Infinitely many solutions minimize Axb\|Ax - b\|
  • The minimum-norm solution is x^=A+b\hat{x} = A^+ b (pseudoinverse)
  • Can use SVD: A=UΣVTA = U \Sigma V^T gives A+=VΣ+UTA^+ = V \Sigma^+ U^T
Definition 7.15a: Pseudoinverse

The Moore-Penrose pseudoinverse A+A^+ generalizes matrix inverse:

A+=VΣ+UTA^+ = V \Sigma^+ U^T

where Σ+\Sigma^+ inverts nonzero singular values and transposes.

4. Applications

Linear Regression

y=Xβ+εy = X\beta + \varepsilon. Least squares gives β^=(XTX)1XTy\hat{\beta} = (X^T X)^{-1} X^T y.

Polynomial Fitting

Fit degree-n polynomial using Vandermonde matrix. Same least squares setup.

Signal Denoising

Project noisy signal onto "smooth" subspace (low frequencies, polynomials, etc.).

GPS and Navigation

Overdetermined system from multiple satellites. Least squares finds best position estimate.

Example 7.35: Linear Regression

Given data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n, fit y=β0+β1xy = \beta_0 + \beta_1 x:

X=(1x11xn),y=(y1yn)X = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}, \quad y = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}

Solve XTXβ^=XTyX^T X \hat{\beta} = X^T y to get regression coefficients.

Example 7.35a: Weighted Least Squares

If observations have different reliabilities, use weighted least squares:

minxiwi(aixbi)2=minxW1/2(Axb)2\min_x \sum_i w_i (a_i \cdot x - b_i)^2 = \min_x \|W^{1/2}(Ax - b)\|^2

Solution: x^=(ATWA)1ATWb\hat{x} = (A^T W A)^{-1} A^T W b

Remark 7.18: Regularization

When ATAA^T A is near-singular, add regularization (ridge regression):

x^=(ATA+λI)1ATb\hat{x} = (A^T A + \lambda I)^{-1} A^T b

This trades bias for variance reduction and prevents overfitting.

Example 7.35b: Image Deblurring

A blurred image bb relates to original xx by b=Axb = Ax where AA is a blurring operator.

Deblurring is an ill-posed inverse problem. Regularized least squares:

minxAxb2+λx2\min_x \|Ax - b\|^2 + \lambda \|x\|^2
Fourier Series as Projection

The Fourier series of ff is its projection onto span of trigonometric functions:

fn(x)=k=nnckeikx=projVn(f)f_n(x) = \sum_{k=-n}^{n} c_k e^{ikx} = \text{proj}_{V_n}(f)

where Vn=span{eikx:kn}V_n = \text{span}\{e^{ikx} : |k| \leq n\} and ck=f,eikxc_k = \langle f, e^{ikx} \rangle.

Example 7.35c: Best Polynomial Approximation

Find the best degree-2 polynomial approximation to f(x)=exf(x) = e^x on [1,1][-1, 1]:

Project exe^x onto span{1,x,x2}\text{span}\{1, x, x^2\} using Legendre polynomials (orthogonal):

p(x)=ex,P0P0+ex,P1P1+ex,P2P2p(x) = \langle e^x, P_0 \rangle P_0 + \langle e^x, P_1 \rangle P_1 + \langle e^x, P_2 \rangle P_2

5. Projection in Function Spaces

Definition 7.16: L² Inner Product

On L2[a,b]L^2[a, b], the inner product is:

f,g=abf(x)g(x)dx\langle f, g \rangle = \int_a^b f(x) \overline{g(x)}\, dx
Example 7.36: Projection onto Polynomials

Project f(x)=xf(x) = |x| onto span{1,x}\text{span}\{1, x\} on [1,1][-1, 1]:

x,1=11xdx=1\langle |x|, 1 \rangle = \int_{-1}^1 |x|\, dx = 1

x,x=11xxdx=0\langle |x|, x \rangle = \int_{-1}^1 x|x|\, dx = 0 (odd function)

1,1=2\langle 1, 1 \rangle = 2, so \text{proj}(|x|) = \frac{1}{2}}

Theorem 7.29: Best Approximation in L²

Among all functions in subspace UU, the projection p=projU(f)p = \text{proj}_U(f) minimizes:

fp2=(abf(x)p(x)2dx)1/2\|f - p\|_2 = \left( \int_a^b |f(x) - p(x)|^2\, dx \right)^{1/2}
Example 7.36a: Fourier Projection

Project f(x)=xf(x) = x onto V1=span{1,cosx,sinx}V_1 = \text{span}\{1, \cos x, \sin x\} on [0,2π][0, 2\pi]:

Compute Fourier coefficients:

  • a0=12π02πxdx=πa_0 = \frac{1}{2\pi}\int_0^{2\pi} x\, dx = \pi
  • a1=1π02πxcosxdx=0a_1 = \frac{1}{\pi}\int_0^{2\pi} x \cos x\, dx = 0
  • b1=1π02πxsinxdx=2b_1 = \frac{1}{\pi}\int_0^{2\pi} x \sin x\, dx = -2

So projV1(x)=π2sinx\text{proj}_{V_1}(x) = \pi - 2\sin x.

Remark 7.19: Parseval and Projection

The error in projecting onto first nn Fourier modes is:

ffn2=f2knck2\|f - f_n\|^2 = \|f\|^2 - \sum_{|k| \leq n} |c_k|^2

This decreases as we add more modes (Bessel's inequality).

Example 7.36b: Legendre Approximation

The Legendre polynomials are orthogonal on [1,1][-1, 1]:

P0=1,P1=x,P2=3x212,P_0 = 1, \quad P_1 = x, \quad P_2 = \frac{3x^2 - 1}{2}, \quad \ldots

Projection onto span{P0,,Pn}\text{span}\{P_0, \ldots, P_n\} gives best polynomial approximation.

6. Common Mistakes

Confusing projection with the projection formula

proj_U(v) is a vector. ⟨v, e⟩ is a scalar. Don't forget to multiply by e!

Using projection formula with non-orthonormal basis

Σ⟨v, eᵢ⟩eᵢ only works for orthonormal basis. Use P = A(AᵀA)⁻¹Aᵀ for general basis.

Forgetting to check full column rank

(AᵀA)⁻¹ exists only if A has full column rank. Otherwise, infinitely many solutions.

Confusing P with Pᵀ

For orthogonal projections, P = Pᵀ. But for general linear maps, this may not hold.

Wrong matrix in projection formula

P = A(AᵀA)⁻¹Aᵀ, NOT (AAᵀ)⁻¹. The order matters!

Remark 7.20: Debugging Tips

When verifying projections:

  • Check P² = P (idempotent)
  • Check Pᵀ = P (symmetric)
  • Check Pv ∈ U (result is in subspace)
  • Check (v - Pv) ⊥ U (residual is orthogonal)

7. Key Formulas Summary

Projection Formulas

  • • Orthonormal: v,eiei\sum \langle v, e_i \rangle e_i
  • • General: P=A(ATA)1ATP = A(A^T A)^{-1} A^T
  • • Onto line: aaTaTav\frac{aa^T}{a^T a}v
  • • Orthonormal Q: P=QQTP = QQ^T

Least Squares

  • • Normal eq: ATAx^=ATbA^T A \hat{x} = A^T b
  • • Solution: x^=(ATA)1ATb\hat{x} = (A^T A)^{-1} A^T b
  • • Via QR: Rx^=QTbR\hat{x} = Q^T b
  • • Residual: bAx^col(A)b - A\hat{x} \perp \text{col}(A)

Projection Matrix Properties

PropertyFormulaMeaning
IdempotentP2=PP^2 = PProject twice = project once
SymmetricPT=PP^T = POrthogonal projection
Eigenvalues0 and 1 onlyU⊥ and U
ComplementIPI - PProjects onto U⊥

8. Advanced Topics

Theorem 7.30: Projection and Operator Norm

For orthogonal projection PP:

Pop=1(unless P=0)\|P\|_{op} = 1 \quad \text{(unless } P = 0\text{)}

The operator norm of a nonzero orthogonal projection is exactly 1.

Definition 7.17: Oblique Projection

An oblique projection onto UU along WW satisfies:

  • P2=PP^2 = P (still idempotent)
  • im(P)=U\text{im}(P) = U, ker(P)=W\text{ker}(P) = W
  • But WW need not be UU^\perp!
Example 7.37: Oblique vs Orthogonal

Project onto x-axis along the line y=xy = x (not perpendicular!):

P=(1100)P = \begin{pmatrix} 1 & -1 \\ 0 & 0 \end{pmatrix}

Check: P2=PP^2 = P ✓, but PTPP^T \neq P (not orthogonal).

Theorem 7.31: Spectral Projections

For self-adjoint AA with eigenvalue λ\lambda and eigenspace EλE_\lambda:

Pλ=projEλP_\lambda = \text{proj}_{E_\lambda}

These projections are orthogonal and sum to identity: λPλ=I\sum_\lambda P_\lambda = I.

Remark 7.21: Connection to Spectral Theorem

The Spectral Theorem says:

A=λλPλA = \sum_\lambda \lambda P_\lambda

Every self-adjoint operator is a weighted sum of orthogonal projections onto eigenspaces.

Example 7.37a: Projection in Quantum Mechanics

In quantum mechanics, observables are self-adjoint operators. Measurement projects the state onto an eigenspace:

ψPλψ/Pλψ|\psi\rangle \mapsto P_\lambda |\psi\rangle / \|P_\lambda |\psi\rangle\|

The probability of outcome λ\lambda is Pλψ2\|P_\lambda |\psi\rangle\|^2.

9. What's Next

Building on Projections

Spectral Theorem (LA-7.5)

Self-adjoint operators decompose into spectral projections onto eigenspaces.

SVD (LA-8.1)

Singular value decomposition generalizes spectral theory to all matrices.

Quadratic Forms (LA-8.2)

Projections help diagonalize quadratic forms and classify critical points.

Numerical Methods

Iterative methods like GMRES and conjugate gradients use projections.

Example 7.38: Practice Problem 1

Project v=(1,2,3,4)Tv = (1, 2, 3, 4)^T onto U=span{(1,1,0,0)T,(0,0,1,1)T}U = \text{span}\{(1, 1, 0, 0)^T, (0, 0, 1, 1)^T\}:

Solution outline:

  1. First orthonormalize the basis using Gram-Schmidt
  2. Apply projection formula: proj=v,eiei\text{proj} = \sum \langle v, e_i \rangle e_i
Example 7.38a: Practice Problem 2

Find the least squares fit of y=ax2+by = ax^2 + b to (−1, 2), (0, 1), (1, 2), (2, 5):

Solution outline:

  1. Set up design matrix A=[xi21]A = [x_i^2 | 1]
  2. Form normal equations ATAx^=ATyA^T A \hat{x} = A^T y
  3. Solve for x^=(a,b)T\hat{x} = (a, b)^T
Remark 7.22: Study Tips
  • Visualize: Draw pictures for 2D/3D projections
  • Verify: Always check P² = P and Pᵀ = P
  • Compare methods: Normal equations vs QR decomposition
  • Understand geometry: Projection = closest point

10. Quick Reference

Key Definitions

  • Projection: closest point in subspace
  • Residual: v − proj(v), perpendicular to U
  • Distance: ||v − proj(v)||
  • Least squares: minimize ||Ax − b||

Computation Steps

  1. Get orthonormal basis for U (Gram-Schmidt)
  2. Compute projection: Σ⟨v, eᵢ⟩eᵢ
  3. Or: P = QQᵀ, then Pv
  4. Verify: check P² = P, Pᵀ = P

Least Squares Steps

  1. Form design matrix A
  2. Compute AᵀA and Aᵀb
  3. Solve AᵀA x̂ = Aᵀb
  4. Or: A = QR, solve Rx̂ = Qᵀb

Key Results

  • • Best Approximation Theorem
  • • Projection is linear operator
  • • v = proj_U(v) + proj_U⊥(v)
  • • ||Pv|| ≤ ||v||
Remark 7.23: Historical Note

The method of least squares was developed independently by Carl Friedrich Gauss (1795) and Adrien-Marie Legendre (1805). Gauss used it to predict the orbit of the asteroid Ceres, one of the great triumphs of mathematical astronomy.

Algorithm: Computing Projection

  1. Given: Vector v, subspace U (as basis vectors)
  2. Step 1: Orthonormalize basis using Gram-Schmidt → {e₁, ..., eₖ}
  3. Step 2: Compute coefficients cᵢ = ⟨v, eᵢ⟩
  4. Step 3: proj_U(v) = Σᵢ cᵢ eᵢ
  5. Step 4: Verify: check proj ∈ U and (v − proj) ⊥ U

Notation Summary

proj_U(v)Orthogonal projection of v onto U
PProjection matrix (onto column space)
A⁺Moore-Penrose pseudoinverse
Least squares solution
rResidual (b − Ax̂)

11. More Worked Examples

Example 7.39: Complete 3D Projection Example

Project v=(2,3,5)Tv = (2, 3, 5)^T onto plane U=span{(1,0,1)T,(0,1,1)T}U = \text{span}\{(1, 0, 1)^T, (0, 1, 1)^T\}:

Step 1: Orthonormalize using Gram-Schmidt:

e1=12(1,0,1)Te_1 = \frac{1}{\sqrt{2}}(1, 0, 1)^T
u2=(0,1,1)T12(1,0,1)T=(12,1,12)Tu_2 = (0, 1, 1)^T - \frac{1}{2}(1, 0, 1)^T = (-\frac{1}{2}, 1, \frac{1}{2})^T
e2=16(1,2,1)Te_2 = \frac{1}{\sqrt{6}}(-1, 2, 1)^T

Step 2: Compute coefficients:

c1=v,e1=12(2+5)=72c_1 = \langle v, e_1 \rangle = \frac{1}{\sqrt{2}}(2 + 5) = \frac{7}{\sqrt{2}}
c2=v,e2=16(2+6+5)=96c_2 = \langle v, e_2 \rangle = \frac{1}{\sqrt{6}}(-2 + 6 + 5) = \frac{9}{\sqrt{6}}

Step 3: Projection:

projU(v)=7212(1,0,1)T+9616(1,2,1)T\text{proj}_U(v) = \frac{7}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}}(1, 0, 1)^T + \frac{9}{\sqrt{6}} \cdot \frac{1}{\sqrt{6}}(-1, 2, 1)^T
=72(1,0,1)T+32(1,2,1)T=(2,3,5)T= \frac{7}{2}(1, 0, 1)^T + \frac{3}{2}(-1, 2, 1)^T = (2, 3, 5)^T

Interesting: vv is already in the plane! (proj = v)

Example 7.40: Least Squares with Missing Data

Temperature readings with errors: fit T=at+bT = at + b to t = 0, 1, 2, 3 with T = 15, ?, 21, 24:

Use only available data points:

A=(012131),b=(152124)A = \begin{pmatrix} 0 & 1 \\ 2 & 1 \\ 3 & 1 \end{pmatrix}, \quad b = \begin{pmatrix} 15 \\ 21 \\ 24 \end{pmatrix}

Solve normal equations for best-fit line.

Example 7.41: Projection Matrix Computation

Find the projection matrix onto U=span{(1,1,1)T}U = \text{span}\{(1, 1, 1)^T\}:

P=aaTaTa=13(111111111)P = \frac{aa^T}{a^T a} = \frac{1}{3}\begin{pmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{pmatrix}

Verify: P2=PP^2 = P, PT=PP^T = P, eigenvalues are 1 (multiplicity 1) and 0 (multiplicity 2).

Example 7.42: Multivariate Regression

Fit z=β0+β1x+β2yz = \beta_0 + \beta_1 x + \beta_2 y to data points:

X=(1x1y11xnyn),z=(z1zn)X = \begin{pmatrix} 1 & x_1 & y_1 \\ \vdots & \vdots & \vdots \\ 1 & x_n & y_n \end{pmatrix}, \quad z = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix}

Solve XTXβ^=XTzX^T X \hat{\beta} = X^T z for β^=(β0,β1,β2)T\hat{\beta} = (\beta_0, \beta_1, \beta_2)^T.

Example 7.43: Orthogonal Decomposition

Decompose v=(3,4,5)Tv = (3, 4, 5)^T into parts parallel and perpendicular to u=(1,2,2)Tu = (1, 2, 2)^T:

Parallel part:

v=proju(v)=v,uu2u=219(1,2,2)T=73(1,2,2)Tv_\parallel = \text{proj}_u(v) = \frac{\langle v, u \rangle}{\|u\|^2} u = \frac{21}{9}(1, 2, 2)^T = \frac{7}{3}(1, 2, 2)^T

Perpendicular part:

v=vv=(3,4,5)T73(1,2,2)T=(23,23,13)Tv_\perp = v - v_\parallel = (3, 4, 5)^T - \frac{7}{3}(1, 2, 2)^T = (\frac{2}{3}, -\frac{2}{3}, \frac{1}{3})^T

Verify: v,u=0\langle v_\perp, u \rangle = 0

12. Connections to Other Topics

Within Linear Algebra

Gram-Schmidt (LA-7.3)
  • • Provides orthonormal basis for projection formula
  • • Each step subtracts projection onto previous vectors
  • • QR decomposition encodes projection coefficients
Spectral Theorem (LA-7.5)
  • • Self-adjoint operators decompose into projections
  • • Eigenspaces are pairwise orthogonal
  • • A = Σλᵢ Pᵢ (spectral decomposition)
SVD (LA-8.1)
  • • Generalizes eigendecomposition
  • • Pseudoinverse from SVD
  • • Low-rank approximation via projection
Four Fundamental Subspaces
  • • col(A), null(A), row(A), null(Aᵀ)
  • • Projections relate these subspaces
  • • ℝⁿ = row(A) ⊕ null(A)
Theorem 7.32: Fundamental Theorem Connection

For AMm×nA \in M_{m \times n}:

  • projcol(A)\text{proj}_{\text{col}(A)} sends bb to Ax^A\hat{x} (the least squares fit)
  • projnull(AT)\text{proj}_{\text{null}(A^T)} gives the residual bAx^b - A\hat{x}
  • These projections are complementary: Pcol(A)+Pnull(AT)=ImP_{\text{col}(A)} + P_{\text{null}(A^T)} = I_m
Remark 7.24: Beyond Linear Algebra

Projections appear throughout mathematics:

  • Functional Analysis: Hilbert space projections, Riesz representation
  • Optimization: Projected gradient descent, alternating projections
  • Statistics: Regression, ANOVA decompositions, PCA
  • Quantum Mechanics: Measurement as projection, density operators

13. Chapter Summary

Key Takeaways

Orthogonal Projection
  • • Closest point in subspace
  • • Formula: Σ⟨v, eᵢ⟩eᵢ (orthonormal basis)
  • • Matrix form: P = A(AᵀA)⁻¹Aᵀ
  • • Properties: P² = P, Pᵀ = P
Least Squares
  • • Minimize ||Ax − b||
  • • Normal equation: AᵀAx̂ = Aᵀb
  • • Ax̂ = projection of b onto col(A)
  • • Residual ⊥ col(A)
Applications
  • • Linear/polynomial regression
  • • Signal processing
  • • Data fitting
  • • Navigation systems
Computation
  • • Normal equations (simple)
  • • QR decomposition (stable)
  • • SVD (most general)
  • • Always verify results
Remark 7.25: Mastery Checklist

You've mastered orthogonal projections when you can:

  • ✓ Compute projections using orthonormal bases
  • ✓ Construct and verify projection matrices
  • ✓ Set up and solve least squares problems
  • ✓ Interpret the geometric meaning of projections
  • ✓ Apply least squares to regression problems
  • ✓ Choose between normal equations and QR method

Pro Tips for Exams

  • • For projection onto a line, use the simplified formula: proj = (v·u/u·u)u
  • • Always verify P² = P and Pᵀ = P for projection matrices
  • • When fitting data, set up the design matrix carefully with correct dimensions
  • • Remember: residual ⊥ column space (this is the key geometric insight)
  • • QR is more stable but normal equations are faster for hand computation

14. Additional Practice Examples

Example 7.44: Projection onto Kernel

Find the projection of v=(1,2,3)Tv = (1, 2, 3)^T onto null(A)\text{null}(A) where A=(111)A = \begin{pmatrix} 1 & 1 & 1 \end{pmatrix}:

Step 1: Find null(A): x+y+z=0x + y + z = 0, so null(A)=span{(1,1,0)T,(1,0,1)T}\text{null}(A) = \text{span}\{(-1, 1, 0)^T, (-1, 0, 1)^T\}

Step 2: Orthonormalize the basis, then project.

Alternative: Use IProw(A)I - P_{\text{row}(A)} where Prow(A)=1311TP_{\text{row}(A)} = \frac{1}{3}\mathbf{1}\mathbf{1}^T.

Example 7.45: Distance Between Subspaces

Find the angle between subspaces U=span{(1,0,0)T}U = \text{span}\{(1, 0, 0)^T\} and W=span{(1,1,0)T}W = \text{span}\{(1, 1, 0)^T\}:

The angle θ\theta satisfies:

cosθ=PUPWop=eU,eWeUeW=12\cos\theta = \|P_U P_W\|_{op} = \frac{|\langle e_U, e_W \rangle|}{\|e_U\|\|e_W\|} = \frac{1}{\sqrt{2}}

So θ=45°\theta = 45°.

Example 7.46: Exponential Fit

Fit y=ceaxy = ce^{ax} to data by linearizing: lny=lnc+ax\ln y = \ln c + ax

Set up least squares with Y=lnyY = \ln y, then solve for (a,lnc)(a, \ln c).

(1x11xn)(lnca)(lny1lnyn)\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} \ln c \\ a \end{pmatrix} \approx \begin{pmatrix} \ln y_1 \\ \vdots \\ \ln y_n \end{pmatrix}
Example 7.47: Sinusoidal Fit

Fit y=Asin(x)+Bcos(x)y = A\sin(x) + B\cos(x) to data points:

X=(sinx1cosx1sinxncosxn),y=(y1yn)X = \begin{pmatrix} \sin x_1 & \cos x_1 \\ \vdots & \vdots \\ \sin x_n & \cos x_n \end{pmatrix}, \quad y = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}

Solve XTXβ^=XTyX^T X \hat{\beta} = X^T y for β^=(A,B)T\hat{\beta} = (A, B)^T.

Remark 7.26: Choosing the Right Model

When fitting data:

  • Linear: y=ax+by = ax + b for linear trends
  • Polynomial: y=akxky = \sum a_k x^k for smooth curves
  • Exponential: y=ceaxy = ce^{ax} for growth/decay
  • Trigonometric: y=(akcoskx+bksinkx)y = \sum(a_k \cos kx + b_k \sin kx) for periodic data

All reduce to linear least squares with appropriate design matrix!

15. Computational Aspects

Method Comparison
MethodComplexityStabilityBest For
Normal equationsO(mn2+n3)O(mn^2 + n^3)PoorWell-conditioned
QR decompositionO(mn2)O(mn^2)GoodGeneral use
SVDO(mn2)O(mn^2)ExcellentRank-deficient
Iterative (LSQR)O(mn)O(mn) per iterGoodVery large sparse
Remark 7.27: Numerical Stability

The condition number κ(A)=σmax/σmin\kappa(A) = \sigma_{\max}/\sigma_{\min} affects accuracy:

  • Normal equations: error ∝ κ(A)²
  • QR method: error ∝ κ(A)
  • For ill-conditioned problems (large κ), use QR or SVD!
Example 7.48: Ill-Conditioned Example

Polynomial fitting with high-degree polynomials leads to Vandermonde matrices with very large condition numbers:

A=(1x1x12x1n1xmxm2xmn)A = \begin{pmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \cdots & x_m^n \end{pmatrix}

For better conditioning, use orthogonal polynomials (Legendre, Chebyshev) instead of monomials.

Remark 7.28: Regularization Techniques

When the problem is ill-posed, add regularization:

  • Ridge regression: Minimize Axb2+λx2\|Ax - b\|^2 + \lambda\|x\|^2
  • LASSO: Minimize Axb2+λx1\|Ax - b\|^2 + \lambda\|x\|_1 (promotes sparsity)
  • Truncated SVD: Keep only largest singular values
Orthogonal Projections Practice
12
Questions
0
Correct
0%
Accuracy
1
The orthogonal projection of vv onto subspace UU is:
Easy
Not attempted
2
If {e1,...,ek}\{e_1,...,e_k\} is orthonormal basis for UU, then projU(v)=\text{proj}_U(v) =
Medium
Not attempted
3
The projection matrix PP onto column space of AA is:
Medium
Not attempted
4
The normal equation for least squares AxbAx \approx b is:
Medium
Not attempted
5
For orthogonal projection PP, which is true?
Medium
Not attempted
6
If vUv \in U, then projU(v)=\text{proj}_U(v) =
Easy
Not attempted
7
The residual vprojU(v)v - \text{proj}_U(v) is:
Medium
Not attempted
8
Least squares minimizes:
Easy
Not attempted
9
The projection onto line through unit vector uu is:
Easy
Not attempted
10
For overdetermined Ax=bAx = b, least squares gives:
Medium
Not attempted
11
IPI - P projects onto:
Medium
Not attempted
12
In linear regression y=ax+by = ax + b, we minimize:
Medium
Not attempted

Frequently Asked Questions

What is orthogonal projection geometrically?

It's 'dropping a perpendicular' from v to the subspace U. The projection is where the perpendicular meets U, and the residual v - proj is perpendicular to U.

Why is the projection the closest point?

By the Pythagorean theorem: ||v - w||² = ||v - proj||² + ||proj - w||² for any w in U. Since the second term is non-negative, ||v - w|| ≥ ||v - proj||.

How does projection relate to least squares?

Ax = b has no solution when b ∉ col(A). The 'best' x makes Ax = proj_{col(A)}(b), the projection of b onto column space. This minimizes ||Ax - b||.

Why is the normal equation called 'normal'?

'Normal' means perpendicular. The equation AᵀAx = Aᵀb ensures the residual b - Ax is orthogonal (normal) to col(A), which is the defining property of projection.

When does AᵀA have an inverse?

AᵀA is invertible iff A has full column rank (columns are linearly independent). This is necessary for a unique least squares solution.

What if A doesn't have full rank?

There are infinitely many least squares solutions. The minimum-norm solution uses the pseudoinverse: x = A⁺b. Or add regularization (ridge regression).

How is projection used in signal processing?

Filtering is projection! Keeping only certain frequency components projects the signal onto the subspace spanned by those frequencies.

What's the connection to regression?

Linear regression y = Xβ + ε is least squares: minimize ||y - Xβ||². The normal equation Xᵀy = XᵀXβ gives β = (XᵀX)⁻¹Xᵀy.

Can I project onto infinite-dimensional subspaces?

Yes, in Hilbert spaces. For closed subspaces, orthogonal projections exist. Fourier truncation (keeping first n terms) is projection onto finite-dimensional subspace.

How do projections relate to eigenvalues?

Projection matrices have eigenvalues 0 and 1 only. Eigenvalue 1 corresponds to vectors in U, eigenvalue 0 to vectors in U⊥.