Learn Fisher's linear discriminant for classification through optimal projection, from theory to practical applications in wine quality assessment and customer segmentation
Linear Discriminant Analysis (LDA), also known as Fisher's Linear Discriminant, is a classification method that finds a linear combination of features which best separates two or more classes. Unlike logistic regression which directly models P(y|x), LDA is a generative modelthat models the distribution of features for each class.
LDA seeks to find a projection direction w such that:
Maximize: J(w) = (wTS_b w) / (wTS_w w)
where S_b is between-class scatter matrix and S_w is within-class scatter matrix
1. Fisher's Discriminant: Find projection maximizing class separation
2. Probabilistic View: Model each class as Gaussian distribution with shared covariance, then use Bayes' theorem for classification
Understanding scatter matrices and the optimization objective
Measures how spread out samples are within each class. We want to minimize this (tight clusters):
S_w = Σ_k=0,1 Σ_x∈class_k (x - μ_k)(x - μ_k)T
where μ_k is the mean of class k. This measures variance within each class.
Measures how far apart the class means are. We want to maximize this (well-separated classes):
S_b = (μ_0 - μ_1)(μ_0 - μ_1)T
For binary classification, this simplifies to the outer product of the difference between class means.
The optimal direction w* that maximizes the Fisher criterion J(w) = (wTS_b w) / (wTS_w w) is:
w* = S_w-1(μ_0 - μ_1)
This elegant solution shows that the optimal direction is proportional to the inverse of the within-class scatter times the difference between class means.
Intuition: If classes have small within-class variance (S_w small), we can separate them easily. If class means are far apart (μ_0 - μ_1 large), separation is easier. The formula combines both factors optimally.
Once we have w*, we project a new sample x and compare to a threshold:
Project sample:
z = wTx
Classification threshold:
threshold = wT((μ_0 + μ_1) / 2)
Classify as class 0 if:
z < threshold, else class 1
Using LDA to classify wine quality from chemical properties
A wine producer has measurements from 500 bottles, classified into three quality tiers. Sample:
| ID | Alcohol % | Acidity | pH | Sugar g/L | Quality |
|---|---|---|---|---|---|
| 1 | 12.8 | 2.8 | 3.2 | 2.1 | Good |
| 2 | 13.5 | 3.2 | 3.1 | 1.8 | Excellent |
| 3 | 11.2 | 2.5 | 3.5 | 5.2 | Fair |
| 4 | 13.8 | 3 | 3 | 1.5 | Excellent |
| 5 | 12 | 2.6 | 3.4 | 3.8 | Good |
| 6 | 10.8 | 2.3 | 3.6 | 6.5 | Fair |
| 7 | 13.2 | 3.1 | 3.1 | 2 | Excellent |
| 8 | 11.8 | 2.7 | 3.3 | 4.2 | Good |
LDA finds two discriminant directions (for 3 classes, we get at most K-1 = 2 directions):
First Discriminant (69% variance):
LD1 = 0.48 × alcohol + 0.35 × acidity - 0.62 × pH - 0.45 × sugar
Second Discriminant (31% variance):
LD2 = 0.22 × alcohol - 0.58 × acidity + 0.25 × pH + 0.73 × sugar
Using LDA for e-commerce customer classification
An online retailer wants to segment customers into High-Value, Medium-Value, and Low-Value groups to personalize marketing campaigns. Sample data:
| ID | Monthly Spend | Frequency | Avg Order | Tenure (mo) | Segment |
|---|---|---|---|---|---|
| 1 | $450 | 12/mo | $37.5 | 24 | High-Value |
| 2 | $85 | 2/mo | $42.5 | 6 | Low-Value |
| 3 | $280 | 8/mo | $35 | 18 | Medium-Value |
| 4 | $520 | 15/mo | $34.7 | 36 | High-Value |
| 5 | $120 | 3/mo | $40 | 8 | Low-Value |
| 6 | $310 | 9/mo | $34.4 | 20 | Medium-Value |
LDA reveals that customer value is primarily determined by:
This suggests marketing should focus on increasing purchase frequency for medium-value customers rather than just order value.
Understanding LDA's requirements and limitations
LDA assumes features follow a multivariate normal (Gaussian) distribution within each class.
Check: Use Q-Q plots, Shapiro-Wilk test, or visualize feature distributions by class. Moderate violations are often acceptable.
All classes share the same covariance matrix. This is a strong assumption that's often violated.
Check: Compare covariance matrices between classes. If very different, consider Quadratic Discriminant Analysis (QDA) instead.
Samples are independent and identically distributed (i.i.d.).
Check: Watch for temporal dependencies, duplicate samples, or hierarchical data structures that violate independence.
LDA is sensitive to outliers because they heavily influence mean and covariance estimates.
Check: Use box plots, z-scores, or isolation forest to detect and handle outliers before applying LDA.
| Scenario | Use LDA When... | Use Logistic Regression When... |
|---|---|---|
| Data Distribution | Features are approximately Gaussian | No assumptions about feature distribution |
| Number of Classes | 3+ classes (naturally multi-class) | Binary classification (or use OvR/OvO) |
| Sample Size | Small to medium (if assumptions met) | Any size, especially large datasets |
| Class Separation | Classes are well-separated | Any level of separation |
| Dimensionality Reduction | Need low-dimensional projection | Classification only |
| Computational Speed | Need fast training (closed-form) | Can afford iterative optimization |
Start with logistic regression as your default for binary classification—it's more robust and flexible. Try LDA when you have multi-class problems, need dimensionality reduction, or have strong evidence that Gaussian assumptions hold. Always compare both methods empirically using cross-validation on your specific dataset.
Advanced variants and related techniques
Relaxes the equal covariance assumption. Each class has its own covariance matrix, resulting in quadratic (curved) decision boundaries instead of linear ones.
When to use:
Classes have different covariance structures, non-linear boundaries, or you have enough data to estimate separate covariances reliably.
Trade-off:
More flexible but requires O(d²K) parameters vs O(d²) for LDA. Needs more training data and is more prone to overfitting.
Interpolates between LDA and QDA using a regularization parameter α:
Σ_k(α) = α × Σ_k + (1-α) × Σ_pooled
When α=0, equivalent to LDA. When α=1, equivalent to QDA. Choose α via cross-validation to balance bias-variance tradeoff.
LDA can project high-dimensional data to at most (K-1) dimensions while preserving class separability, where K is the number of classes. This is powerful for visualization and feature extraction.
PCA is unsupervised and finds directions of maximum variance regardless of class labels. LDA is supervised and finds directions that maximize class separation. For classification tasks, LDA typically provides more discriminative low-dimensional representations. PCA can discard important class information if most variance is within-class rather than between-class.
No, standard LDA requires n > d because S_w becomes singular and non-invertible. Solutions: (1) Use regularized LDA (adds λI to S_w), (2) Reduce dimensionality with PCA first, then apply LDA, (3) Use alternative methods like penalized LDA, or (4) Switch to algorithms that handle high-dimensional data better (e.g., SVM, random forests).
LDA assumes continuous features. For categorical variables: (1) Use one-hot encoding for nominal categories, (2) Use ordinal encoding if categories have natural ordering, (3) Consider alternatives like Naive Bayes which handles categorical data naturally, or (4) Use a hybrid approach: continuous features with LDA + categorical features with other methods, then combine predictions.
LDA incorporates class priors P(y) based on training set proportions, which can bias toward majority classes. Solutions: (1) Adjust priors to reflect desired balance or business costs, (2) Resample data before training (oversample minority or undersample majority), (3) Use class weights if your implementation supports them, or (4) Adjust decision thresholds post-training. The class imbalance section covers these techniques in detail.