MathIsimple
Foundation Topic
4-6 Hours

Multivariate Statistics Fundamentals

Master the foundational concepts of multivariate analysis: random vectors, covariance matrices, and sample statistics

Learning Objectives
Understand multivariate data representation and data matrices
Review essential matrix algebra operations
Define random vectors, mean vectors, and covariance matrices
Compute sample statistics for multivariate data
Understand properties of linear combinations
Calculate and interpret Mahalanobis distance

Introduction to Multivariate Data

What is Multivariate Analysis?

Multivariate analysis is the statistical analysis of data involving multiple variables measured on the same observations. It considers relationships among three or more variables simultaneously.

Data Reduction

Simplify complex data (PCA, Factor Analysis)

Classification

Group observations (Discriminant, Cluster Analysis)

Data Matrix Representation

Multivariate data is organized in a data matrix where rows represent observations and columns represent variables:

Xn×p=(x11x12x1px21x22x2pxn1xn2xnp)\mathbf{X}_{n \times p} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}

Dimensions

nn = observations (rows), pp = variables (columns)

Row Vector

xiT=(xi1,xi2,,xip)\mathbf{x}_i^T = (x_{i1}, x_{i2}, \ldots, x_{ip})

Matrix Algebra Review

Basic Matrix Operations

Transpose

(AT)ij=aji(\mathbf{A}^T)_{ij} = a_{ji}

Trace

tr(A)=i=1naii\text{tr}(\mathbf{A}) = \sum_{i=1}^n a_{ii}
Eigenvalues and Eigenvectors
Ae=λe\mathbf{A}\mathbf{e} = \lambda \mathbf{e}

If e\mathbf{e} is non-zero and satisfies this equation, then λ\lambda is an eigenvalue and e\mathbf{e} is the corresponding eigenvector.

Key Properties
  • • Sum of eigenvalues = trace: iλi=tr(A)\sum_i \lambda_i = \text{tr}(\mathbf{A})
  • • Product of eigenvalues = determinant: iλi=A\prod_i \lambda_i = |\mathbf{A}|
Positive Definite Matrices

A symmetric matrix A\mathbf{A} is positive definite if:

xTAx>0for all x0\mathbf{x}^T\mathbf{A}\mathbf{x} > 0 \quad \text{for all } \mathbf{x} \neq \mathbf{0}

Equivalent: All eigenvalues λi>0\lambda_i > 0

Random Vectors & Covariance

Mean Vector
μ=E[X]=(μ1μ2μp)\boldsymbol{\mu} = E[\mathbf{X}] = \begin{pmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_p \end{pmatrix}

Linear Transformation

E[AX+b]=Aμ+bE[\mathbf{A}\mathbf{X} + \mathbf{b}] = \mathbf{A}\boldsymbol{\mu} + \mathbf{b}
Covariance Matrix
Σ=E[(Xμ)(Xμ)T]\boldsymbol{\Sigma} = E[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]
Σ=(σ11σ12σ1pσ21σ22σ2pσp1σp2σpp)\boldsymbol{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1p} \\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{p1} & \sigma_{p2} & \cdots & \sigma_{pp} \end{pmatrix}

Properties

  • • Symmetric: Σ=ΣT\boldsymbol{\Sigma} = \boldsymbol{\Sigma}^T
  • • Positive semi-definite

Transformation

Cov(AX)=AΣAT\text{Cov}(\mathbf{AX}) = \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T

Sample Statistics

Sample Mean Vector & Covariance Matrix

Sample Mean Vector

xˉ=1ni=1nxi\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i

Sample Covariance Matrix

S=1n1i=1n(xixˉ)(xixˉ)T\mathbf{S} = \frac{1}{n-1}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T

Note: Division by n1n-1 (Bessel's correction) gives an unbiased estimator: E[S]=ΣE[\mathbf{S}] = \boldsymbol{\Sigma}

Mahalanobis Distance

The Mahalanobis distance measures distance accounting for correlations between variables:

d2(x,μ)=(xμ)TΣ1(xμ)d^2(\mathbf{x}, \boldsymbol{\mu}) = (\mathbf{x} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})

When Σ=I\boldsymbol{\Sigma} = \mathbf{I}

Reduces to squared Euclidean distance

Applications

Outlier detection, classification, clustering

Spectral Decomposition

Spectral Theorem for Symmetric Matrices

Any symmetric matrix A\mathbf{A} can be decomposed as:

A=PΛPT=i=1pλieieiT\mathbf{A} = \mathbf{P}\boldsymbol{\Lambda}\mathbf{P}^T = \sum_{i=1}^{p} \lambda_i \mathbf{e}_i\mathbf{e}_i^T

P\mathbf{P} is orthogonal

PTP=PPT=I\mathbf{P}^T\mathbf{P} = \mathbf{P}\mathbf{P}^T = \mathbf{I}

Λ\boldsymbol{\Lambda} is diagonal

Λ=diag(λ1,,λp)\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_p)

Square Root of Positive Definite Matrix

A1/2=PΛ1/2PTwhereΛ1/2=diag(λ1,,λp)\mathbf{A}^{1/2} = \mathbf{P}\boldsymbol{\Lambda}^{1/2}\mathbf{P}^T \quad \text{where} \quad \boldsymbol{\Lambda}^{1/2} = \text{diag}(\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_p})
Powers and Inverse via Spectral Decomposition

Matrix Powers

Ak=PΛkPT\mathbf{A}^k = \mathbf{P}\boldsymbol{\Lambda}^k\mathbf{P}^T

Matrix Inverse

A1=PΛ1PT\mathbf{A}^{-1} = \mathbf{P}\boldsymbol{\Lambda}^{-1}\mathbf{P}^T

Determinant via Eigenvalues

A=i=1pλi|\mathbf{A}| = \prod_{i=1}^{p} \lambda_i

The determinant equals the product of all eigenvalues.

Partitioned Matrices

Block Matrix Representation

A covariance matrix can be partitioned to analyze subsets of variables:

Σ=(Σ11Σ12Σ21Σ22)\boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}

Σ11\boldsymbol{\Sigma}_{11}: p1×p1p_1 \times p_1

Covariance within first group

Σ12\boldsymbol{\Sigma}_{12}: p1×p2p_1 \times p_2

Cross-covariance between groups

Symmetry Property

Σ21=Σ12T\boldsymbol{\Sigma}_{21} = \boldsymbol{\Sigma}_{12}^T

Inverse of Partitioned Matrix

The inverse of a partitioned positive definite matrix involves the Schur complement:

Schur Complement

Σ112=Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{11 \cdot 2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}

Interpretation

The Schur complement Σ112\boldsymbol{\Sigma}_{11 \cdot 2} is the conditional covariance matrix ofX(1)\mathbf{X}^{(1)} given X(2)\mathbf{X}^{(2)}.

Generalized Variance

Measures of Total Variability

Total Variance

tr(Σ)=i=1pσii=i=1pλi\text{tr}(\boldsymbol{\Sigma}) = \sum_{i=1}^{p} \sigma_{ii} = \sum_{i=1}^{p} \lambda_i

Sum of all variances

Generalized Variance

Σ=i=1pλi|\boldsymbol{\Sigma}| = \prod_{i=1}^{p} \lambda_i

Product of eigenvalues

Geometric Interpretation

The generalized variance Σ|\boldsymbol{\Sigma}| is proportional to the squared volume of the concentration ellipsoid. It measures how "spread out" the data is in all directions simultaneously.

When Σ=0|\boldsymbol{\Sigma}| = 0

The variables are linearly dependent (at least one eigenvalue is zero). The data lies in a lower-dimensional subspace.

Concentration Ellipsoid

For multivariate data, the concentration ellipsoid generalizes the concept of confidence intervals:

{x:(xμ)TΣ1(xμ)c2}\{\mathbf{x}: (\mathbf{x} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}) \leq c^2\}

Principal Axes

Directions given by eigenvectors ei\mathbf{e}_i

Semi-axis Lengths

Proportional to cλic\sqrt{\lambda_i}

Linear Combinations

Properties of Linear Combinations

For Y=aTX=a1X1+a2X2++apXpY = \mathbf{a}^T\mathbf{X} = a_1X_1 + a_2X_2 + \cdots + a_pX_p:

Mean

E[Y]=aTμE[Y] = \mathbf{a}^T\boldsymbol{\mu}

Variance

Var(Y)=aTΣa\text{Var}(Y) = \mathbf{a}^T\boldsymbol{\Sigma}\mathbf{a}

Multiple Linear Combinations

For Y=AX\mathbf{Y} = \mathbf{A}\mathbf{X}:

Cov(Y)=AΣAT\text{Cov}(\mathbf{Y}) = \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T
Covariance Between Linear Combinations

For two linear combinations Y1=aTXY_1 = \mathbf{a}^T\mathbf{X} and Y2=bTXY_2 = \mathbf{b}^T\mathbf{X}:

Cov(Y1,Y2)=aTΣb\text{Cov}(Y_1, Y_2) = \mathbf{a}^T\boldsymbol{\Sigma}\mathbf{b}

Special Case: Uncorrelated Combinations

Y1Y_1 and Y2Y_2 are uncorrelated if aTΣb=0\mathbf{a}^T\boldsymbol{\Sigma}\mathbf{b} = 0

Maximum Variance Direction

Among all unit-length linear combinations Y=aTXY = \mathbf{a}^T\mathbf{X} with a=1\|\mathbf{a}\| = 1:

maxa=1Var(aTX)=λ1\max_{\|\mathbf{a}\|=1} \text{Var}(\mathbf{a}^T\mathbf{X}) = \lambda_1

achieved when a=e1\mathbf{a} = \mathbf{e}_1 (first eigenvector)

Foundation for PCA

This result is the theoretical basis for Principal Component Analysis. The first PC is the direction of maximum variance.

Random Vectors and Sampling

Random Sampling

Sample of n observations from p-variate distribution:

X1,X2,,Xni.i.d. F\mathbf{X}_1, \mathbf{X}_2, \ldots, \mathbf{X}_n \sim \text{i.i.d. } F

Sample mean vector: xˉ=1ni=1nxi\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
Given two matrices A2×3\mathbf{A}_{2 \times 3} and B3×4\mathbf{B}_{3 \times 4}, what is the dimension of the product AB\mathbf{AB}?
Not attempted
2
For a random vector X=(X1,X2,X3)T\mathbf{X} = (X_1, X_2, X_3)^T, the covariance matrix Σ\boldsymbol{\Sigma} has dimension:
Not attempted
3
The sample mean vector xˉ\bar{\mathbf{x}} for a dataset with nn observations is computed as:
Not attempted
4
A matrix A\mathbf{A} is positive definite if and only if:
Not attempted
5
The covariance matrix Σ\boldsymbol{\Sigma} of a random vector satisfies which property?
Not attempted
6
For a linear combination Y=aTXY = \mathbf{a}^T\mathbf{X} where X\mathbf{X} has covariance Σ\boldsymbol{\Sigma}, the variance of YY is:
Not attempted
7
The squared Mahalanobis distance between point x\mathbf{x} and mean μ\boldsymbol{\mu} is:
Not attempted
8
The correlation matrix R\mathbf{R} is related to covariance matrix Σ\boldsymbol{\Sigma} by:
Not attempted
9
For centered data matrix Xn×p\mathbf{X}_{n \times p}, the sample covariance matrix is:
Not attempted
10
If eigenvalues of a 3×33 \times 3 covariance matrix are λ1=5\lambda_1 = 5, λ2=3\lambda_2 = 3, λ3=2\lambda_3 = 2, the total variance is:
Not attempted

Frequently Asked Questions

What is the difference between covariance and correlation matrices?

The covariance matrix contains raw covariances between variables, which depend on the scales of measurement. The correlation matrix standardizes these to values between -1 and 1, making it easier to compare relationships across variables with different scales. Correlation matrix has 1s on the diagonal.

Why is the Mahalanobis distance important?

Unlike Euclidean distance, Mahalanobis distance accounts for correlations between variables and scales by the variance structure. It measures how many standard deviations away a point is from the center, making it useful for outlier detection and classification in multivariate data.

Why must covariance matrices be positive semi-definite?

The covariance matrix must be positive semi-definite because variances cannot be negative. For any linear combination of variables, the variance must be non-negative: Var(aTX)=aTΣa0\text{Var}(\mathbf{a}^T\mathbf{X}) = \mathbf{a}^T\boldsymbol{\Sigma}\mathbf{a} \geq 0. This property ensures the covariance matrix represents a valid probability distribution.

What do eigenvalues of a covariance matrix represent?

The eigenvalues of a covariance matrix represent the variance along the principal axes of the data distribution. Larger eigenvalues indicate directions with more variability. The sum of eigenvalues equals the total variance, and this concept is fundamental to Principal Component Analysis (PCA).

When should I use n vs n-1 in sample statistics?

Use n-1 (Bessel's correction) when computing the sample covariance matrix to get an unbiased estimator of the population covariance. Use n when you want the maximum likelihood estimate or when working with the entire population. Most statistical software uses n-1 by default.

Ask AI ✨
MathIsimple – Simple, Friendly Math Tools & Learning