Machine Learning/Learning Center/Linear Models/LDA

Linear Discriminant Analysis

Learn Fisher's linear discriminant for classification through optimal projection, from theory to practical applications in wine quality assessment and customer segmentation

What is Linear Discriminant Analysis?

Generative Classification

Linear Discriminant Analysis (LDA), also known as Fisher's Linear Discriminant, is a classification method that finds a linear combination of features which best separates two or more classes. Unlike logistic regression which directly models P(y|x), LDA is a generative modelthat models the distribution of features for each class.

Core Idea: Projection Optimization

LDA seeks to find a projection direction w such that:

•Samples from the same class project closely together (small within-class scatter)
•Samples from different classes project far apart (large between-class scatter)

Maximize: J(w) = (w^TS_b w) / (w^TS_w w)

where S_b is between-class scatter matrix and S_w is within-class scatter matrix

Key Advantages

Closed-form solution (no iterative optimization)
Naturally handles multi-class problems
Can be used for dimensionality reduction
Efficient when assumptions are met

Two Perspectives

1. Fisher's Discriminant: Find projection maximizing class separation

2. Probabilistic View: Model each class as Gaussian distribution with shared covariance, then use Bayes' theorem for classification

Mathematical Formulation

Understanding scatter matrices and the optimization objective

Within-Class Scatter Matrix (S_w)

Measures how spread out samples are within each class. We want to minimize this (tight clusters):

S_w = Σ_k=0,1 Σ_x∈class_k (x - μ_k)(x - μ_k)^T

where μ_k is the mean of class k. This measures variance within each class.

Between-Class Scatter Matrix (S_b)

Measures how far apart the class means are. We want to maximize this (well-separated classes):

S_b = (μ_0 - μ_1)(μ_0 - μ_1)^T

For binary classification, this simplifies to the outer product of the difference between class means.

Optimal Projection Direction

The optimal direction w* that maximizes the Fisher criterion J(w) = (w^TS_b w) / (w^TS_w w) is:

w* = S_w^-1(μ_0 - μ_1)

This elegant solution shows that the optimal direction is proportional to the inverse of the within-class scatter times the difference between class means.

Intuition: If classes have small within-class variance (S_w small), we can separate them easily. If class means are far apart (μ_0 - μ_1 large), separation is easier. The formula combines both factors optimally.

Classification Rule

Once we have w*, we project a new sample x and compare to a threshold:

Project sample:

z = w^Tx

Classification threshold:

threshold = w^T((μ_0 + μ_1) / 2)

Classify as class 0 if:

z < threshold, else class 1

Wine Quality Classification Example

Using LDA to classify wine quality from chemical properties

Dataset Overview

A wine producer has measurements from 500 bottles, classified into three quality tiers. Sample:

ID	Alcohol %	Acidity	pH	Sugar g/L	Quality
1	12.8	2.8	3.2	2.1	Good
2	13.5	3.2	3.1	1.8	Excellent
3	11.2	2.5	3.5	5.2	Fair
4	13.8	3	3	1.5	Excellent
5	12	2.6	3.4	3.8	Good
6	10.8	2.3	3.6	6.5	Fair
7	13.2	3.1	3.1	2	Excellent
8	11.8	2.7	3.3	4.2	Good

Feature Patterns Observed

Excellent Wines

• High alcohol (13-14%)
• Moderate-high acidity (2.8-3.3)
• Low pH (2.9-3.1)
• Low residual sugar (<2.5 g/L)

Good Wines

• Medium alcohol (11.5-12.5%)
• Moderate acidity (2.5-2.8)
• Medium pH (3.2-3.4)
• Medium sugar (2.5-4.5 g/L)

Fair Wines

• Low alcohol (10-11.5%)
• Low acidity (<2.5)
• High pH (3.5-3.7)
• High sugar (>4.5 g/L)

LDA Results

LDA finds two discriminant directions (for 3 classes, we get at most K-1 = 2 directions):

First Discriminant (69% variance):

LD1 = 0.48 × alcohol + 0.35 × acidity - 0.62 × pH - 0.45 × sugar

Second Discriminant (31% variance):

LD2 = 0.22 × alcohol - 0.58 × acidity + 0.25 × pH + 0.73 × sugar

Interpretation:

• LD1 separates primarily based on alcohol content and pH (dry vs sweet wines)
• LD2 distinguishes based on acidity and sugar balance
• Classification accuracy on test set: 87.3%
• Works well because wine quality classes have roughly Gaussian distributions

Customer Segmentation Example

Using LDA for e-commerce customer classification

Business Context

An online retailer wants to segment customers into High-Value, Medium-Value, and Low-Value groups to personalize marketing campaigns. Sample data:

ID	Monthly Spend	Frequency	Avg Order	Tenure (mo)	Segment
1	$450	12/mo	$37.5	24	High-Value
2	$85	2/mo	$42.5	6	Low-Value
3	$280	8/mo	$35	18	Medium-Value
4	$520	15/mo	$34.7	36	High-Value
5	$120	3/mo	$40	8	Low-Value
6	$310	9/mo	$34.4	20	Medium-Value

Why LDA Works Well Here

Business Benefits

Interpretable dimensions: Can understand what separates customer groups
Fast scoring: Real-time customer classification for web personalization
Probability estimates: Know confidence in segment assignment
Handles 3+ classes: Easy to add new segments (e.g., VIP tier)

Key Findings

LDA reveals that customer value is primarily determined by:

1st discriminant (82% variance): Monthly spend and purchase frequency
2nd discriminant (18% variance): Customer tenure and order consistency

This suggests marketing should focus on increasing purchase frequency for medium-value customers rather than just order value.

Assumptions & When to Use LDA

Understanding LDA's requirements and limitations

Key Assumptions

1. Gaussian Distribution

LDA assumes features follow a multivariate normal (Gaussian) distribution within each class.

Check: Use Q-Q plots, Shapiro-Wilk test, or visualize feature distributions by class. Moderate violations are often acceptable.

2. Equal Covariance

All classes share the same covariance matrix. This is a strong assumption that's often violated.

Check: Compare covariance matrices between classes. If very different, consider Quadratic Discriminant Analysis (QDA) instead.

3. Independence

Samples are independent and identically distributed (i.i.d.).

Check: Watch for temporal dependencies, duplicate samples, or hierarchical data structures that violate independence.

4. No Outliers

LDA is sensitive to outliers because they heavily influence mean and covariance estimates.

Check: Use box plots, z-scores, or isolation forest to detect and handle outliers before applying LDA.

When to Choose LDA vs Logistic Regression

Scenario	Use LDA When...	Use Logistic Regression When...
Data Distribution	Features are approximately Gaussian	No assumptions about feature distribution
Number of Classes	3+ classes (naturally multi-class)	Binary classification (or use OvR/OvO)
Sample Size	Small to medium (if assumptions met)	Any size, especially large datasets
Class Separation	Classes are well-separated	Any level of separation
Dimensionality Reduction	Need low-dimensional projection	Classification only
Computational Speed	Need fast training (closed-form)	Can afford iterative optimization

Practical Recommendation

Start with logistic regression as your default for binary classification—it's more robust and flexible. Try LDA when you have multi-class problems, need dimensionality reduction, or have strong evidence that Gaussian assumptions hold. Always compare both methods empirically using cross-validation on your specific dataset.

Extensions & Related Methods

Advanced variants and related techniques

Quadratic Discriminant Analysis (QDA)

Relaxes the equal covariance assumption. Each class has its own covariance matrix, resulting in quadratic (curved) decision boundaries instead of linear ones.

When to use:

Classes have different covariance structures, non-linear boundaries, or you have enough data to estimate separate covariances reliably.

Trade-off:

More flexible but requires O(d²K) parameters vs O(d²) for LDA. Needs more training data and is more prone to overfitting.

Regularized Discriminant Analysis (RDA)

Interpolates between LDA and QDA using a regularization parameter α:

Σ_k(α) = α × Σ_k + (1-α) × Σ_pooled

When α=0, equivalent to LDA. When α=1, equivalent to QDA. Choose α via cross-validation to balance bias-variance tradeoff.

LDA for Dimensionality Reduction

LDA can project high-dimensional data to at most (K-1) dimensions while preserving class separability, where K is the number of classes. This is powerful for visualization and feature extraction.

Example Use Cases:

• Visualize high-dimensional data in 2D (for 3+ classes)
• Reduce features before applying more complex classifiers
• Extract discriminative features for face recognition
• Compress data while maintaining class structure

Common Questions About LDA

What's the difference between PCA and LDA for dimensionality reduction?

PCA is unsupervised and finds directions of maximum variance regardless of class labels. LDA is supervised and finds directions that maximize class separation. For classification tasks, LDA typically provides more discriminative low-dimensional representations. PCA can discard important class information if most variance is within-class rather than between-class.

Can LDA handle more features than samples (d > n)?

No, standard LDA requires n > d because S_w becomes singular and non-invertible. Solutions: (1) Use regularized LDA (adds λI to S_w), (2) Reduce dimensionality with PCA first, then apply LDA, (3) Use alternative methods like penalized LDA, or (4) Switch to algorithms that handle high-dimensional data better (e.g., SVM, random forests).

How do I handle categorical features in LDA?

LDA assumes continuous features. For categorical variables: (1) Use one-hot encoding for nominal categories, (2) Use ordinal encoding if categories have natural ordering, (3) Consider alternatives like Naive Bayes which handles categorical data naturally, or (4) Use a hybrid approach: continuous features with LDA + categorical features with other methods, then combine predictions.

What if my classes have very different sizes (imbalanced)?

LDA incorporates class priors P(y) based on training set proportions, which can bias toward majority classes. Solutions: (1) Adjust priors to reflect desired balance or business costs, (2) Resample data before training (oversample minority or undersample majority), (3) Use class weights if your implementation supports them, or (4) Adjust decision thresholds post-training. The class imbalance section covers these techniques in detail.