Classify observations into groups using Fisher's LDA, QDA, and Bayesian classification methods
Find that maximizes:
Two-Group Solution
Classification Rule
Assign to group with nearest centroid in discriminant space
LDA
Assumes equal covariances → Linear boundaries. Fewer parameters, more stable with small samples.
QDA
Allows different covariances → Quadratic boundaries. More flexible but requires more data.
Classify to group k that maximizes posterior probability:
Class-conditional density for group k
Prior probability of group k
For LDA with normal distributions, classification simplifies to linear scores:
Classification
Assign to group with largest discriminant score
For g groups, find discriminant functions by solving the eigenvalue problem:
Number of Functions
At most discriminant functions
Eigenvalues
measures discriminating power of function i
Proportion of trace: shows proportion of between-group variance explained by function i
Project observations onto first two discriminant functions for visualization:
Scores
for observation i on function j
Centroids
Plot group means in discriminant space
With unequal covariances, the discriminant function becomes quadratic:
Decision Boundary
Quadratic (conic sections: ellipses, hyperbolas, parabolas)
Parameters
More parameters to estimate: separate for each group
Confusion Matrix
Table of predicted vs actual classes. Shows TP, FP, TN, FN for each group.
Cross-Validation
Leave-one-out or k-fold CV gives unbiased error estimates.
ROC Curve: Plots TPR vs FPR. AUC (Area Under Curve) summarizes overall classification performance (0.5 = random, 1.0 = perfect).
Accuracy
Sensitivity (Recall)
Specificity
Precision
Multivariate Normality
Each group follows multivariate normal distribution
Homoscedasticity (LDA)
Equal covariance matrices across groups
Independence
Observations are independent
No Multicollinearity
Predictors should not be perfectly correlated
Robustness: LDA is fairly robust to mild violations of normality, especially with large samples. Use QDA when homoscedasticity is violated.
Given two groups with means and pooled covariance:
Group Means
Discriminant Direction
Classification rule: Project new observation onto discriminant axis. Assign to group with nearest projected centroid.
Forward Selection
Add variables that most improve discrimination (based on Wilks' Lambda)
Backward Elimination
Remove variables that least contribute to discrimination
Caution: Stepwise methods can overfit. Consider cross-validation for variable selection.
For g groups, compute discriminant scores:
Classification Rule
Assign to group with largest
Number of Functions
At most min(g-1, p) discriminant functions
Find linear combinations that maximize between-group to within-group variance ratio:
Solution: Eigenvectors of give discriminant directions.
Summary of classification results:
Accuracy
Error Rate
Sensitivity
Specificity
Leave-One-Out (LOO)
Train on n-1, test on 1; repeat for each observation
K-Fold CV
Split into K parts; train on K-1, test on 1; rotate
Why CV: Resubstitution error (training accuracy) is optimistically biased. CV provides more realistic estimate.
Interpolate between LDA and QDA:
α = 0
Equal to LDA (pooled covariance)
α = 1
Equal to QDA (separate covariances)
For high-dimensional data (p > n), shrink toward identity:
Benefit: Ensures covariance matrix is positive definite and reduces estimation variance in high dimensions.
Medical Diagnosis
Disease classification from symptoms/biomarkers
Credit Scoring
Default vs non-default classification
Face Recognition
Fisherfaces method for identity verification
Species Identification
Taxonomy based on morphological measurements
LDA vs Logistic Regression
LDA assumes normality; logistic is more flexible but may need more data
LDA vs Naive Bayes
Naive Bayes assumes independence; LDA models correlations
LDA vs SVM
SVM finds optimal hyperplane; LDA uses class distributions
LDA vs kNN
kNN is non-parametric; LDA provides interpretable coefficients
LDA can project data onto lower-dimensional space while preserving class separability:
Maximum Dimensions
At most min(g-1, p) discriminant dimensions for g groups and p variables
Comparison with PCA
PCA maximizes variance; LDA maximizes class separation
Use case: Preprocessing for visualization or when subsequent classifier benefits from reduced dimensionality.
Sample Size
Each group should have n > p; total n should be substantial
Outliers
LDA is sensitive to outliers; check Mahalanobis distances
Missing Data
Requires complete cases; consider imputation
Multicollinearity
High collinearity can cause numerical instability
Non-Normality
Transform variables or use robust methods
Unequal Covariances
Use QDA or regularized methods
R
MASS::lda(), MASS::qda()
Python
sklearn.discriminant_analysis
SPSS
Analyze → Classify → Discriminant
SAS
PROC DISCRIM
Overall Accuracy
Sensitivity
For two groups, maximize separation between projected means relative to projected variance:
This can be rewritten as:
where
Solution: Setting derivative to zero yields
For g groups, define between-group scatter matrix:
Within-group scatter matrix:
Eigenvalue Problem
Number of Functions
min(g-1, p) non-zero eigenvalues
Data: Two groups with p=2 variables
Group 1: n₁=3
Group 2: n₂=4
Step 1: Pooled covariance matrix
Step 2: Discriminant coefficients
Step 3: Classification cutoff
Rule: Classify to Group 1 if , otherwise Group 2
New observation:
Discriminant score:
Decision: 0.21 > 0.14 → Assign to Group 1
Equal Priors
Use when no prior knowledge
Proportional
Based on sample sizes
Custom
Set based on domain knowledge
E.g., disease prevalence
Incorporate asymmetric costs into classification:
Cost Matrix
c(k|j) = cost of classifying j as k
Example
False negative in cancer detection costs more than false positive
Singular W
Within-group covariance not invertible when p > n
Overfitting
Model memorizes training data; poor generalization
Shrink group centroids toward overall mean:
Threshold Δ
Controls amount of shrinkage; choose via CV
Feature Selection
Variables shrunk to zero are excluded
Coefficients
Elements of ; weights in linear combination
Loadings (Structure Coefficients)
Correlation between variable and discriminant scores
Recommendation: Use loadings for interpretation (less affected by multicollinearity)
Plot observations and group centroids in discriminant space:
Scatterplot
LD1 vs LD2 with group colors; shows separation
Territorial Map
Show decision boundaries in original or discriminant space
LDA and MANOVA are two sides of the same coin:
MANOVA
Tests if group means differ significantly
LDA
Finds directions that best separate groups
Key insight: Both use eigenvalues. MANOVA uses them for hypothesis testing; LDA uses eigenvectors for classification.
Wilks' Lambda measures discriminating power:
Range
0 < Λ ≤ 1; smaller values indicate better separation
Testing
Transform to F-statistic for significance test
Forward Selection
Start with no variables; add most discriminating one at each step
Backward Elimination
Start with all variables; remove least discriminating one at each step
Stepwise
Combine forward and backward; variables can enter and leave
Selection Criterion
Wilks' Lambda, F-to-enter/remove, or partial F-test
Test significance of adding/removing variable:
Decision: Add variable if F > F_critical; remove if F < F_critical
Using Resubstitution Error
Always optimistically biased; use CV instead
Ignoring Assumptions
Check normality and equal covariances
Too Many Variables
Overfitting when p approaches n; use regularization
Imbalanced Classes
Adjust priors or use stratified sampling
Confusing Coefficients and Loadings
Use loadings for interpretation
Stepwise Selection Overfitting
Validate selected model on independent data
Nonlinear extension using kernel trick:
Idea
Map data to high-dimensional feature space where linear separation is possible
Kernels
RBF, polynomial, sigmoid
Add penalty to objective for regularization:
L1 Penalty (Lasso)
Induces sparsity; automatic feature selection
L2 Penalty (Ridge)
Shrinks coefficients; improves stability
LDA assumes normality and equal covariances; logistic regression is more flexible. LDA can be better with small samples and when assumptions hold.
Adjust prior probabilities to reflect true population proportions or use equal priors if classification costs are equal.
LDA is fairly robust to mild non-normality. For severe violations, consider logistic regression, random forests, or other non-parametric classifiers.