Analyze relationships between two sets of variables through canonical variates and correlations
Given two variable sets (p variables) and (q variables), find linear combinations:
Objective
Maximize
Within-set Covariances
and
Between-set Covariance
Canonical correlations are square roots of eigenvalues of:
Ordering
where r = min(p, q)
Wilks' Lambda
Test sequentially: First test if any correlations are significant, then if remaining ones are, etc.
Canonical Loadings
Correlations between original variables and their own canonical variates. Interpret like factor loadings.
Cross-Loadings
Correlations between variables and the other set's canonical variates. Show direct relationships.
Redundancy Index
Proportion of variance in one set explained by canonical variates of the other set. More interpretable than canonical correlation alone.
Redundancy for set 1 given set 2:
Squared loadings (variance extracted by variate)
Squared canonical correlation (shared variance)
Steps to compute canonical correlations:
Sample Canonical Correlations
Biased upward, especially with small n or many variables
Shrinkage Adjustment
Use adjusted or cross-validated estimates for interpretation
Multiple Regression
When one set has one variable, (multiple correlation)
Hotelling's T²
When one set is a group indicator
MANOVA
One set is group membership coded as dummies
Discriminant Analysis
Canonical variates become discriminant functions
Linear Relationships
CCA only captures linear associations between sets
Multivariate Normality
Required for significance tests
Sample Size
n should be much larger than p + q
No Multicollinearity
Within-set covariance matrices must be invertible
Set 1: Academic measures (Math, Reading). Set 2: Motivation measures (Interest, Effort).
Results
Test
, p < 0.001
Interpretation: First canonical correlation is strong (0.72), suggesting academic and motivation measures share substantial linear relationship.
Test
Chi-Square Approximation
Interpretation
Small Λ (close to 0) → reject H₀ → significant relationship
Test remaining correlations after removing first k:
Procedure: Test all, then test 2nd through sth, etc. Stop when test is not significant.
Correlations between original variables and canonical variates:
Structure Coefficients
Correlations and
Use
Identify which variables contribute most to each dimension
Rule of thumb: Focus on loadings with |r| > 0.3 or 0.4 for interpretation.
Proportion of variance in one set explained by the other:
Interpretation
Average variance explained across canonical dimensions
Note
Redundancy is asymmetric: Rd(Y|X) ≠ Rd(X|Y)
Linearity
CCA detects only linear relationships
Multivariate Normality
Required for hypothesis tests; less critical for descriptive use
No Multicollinearity
Variables within sets should not be too highly correlated
Sample Size
n should be at least 10× total number of variables
Psychology
Relating personality traits to behavioral outcomes
Ecology
Species composition vs environmental variables
Education
Academic performance vs motivation/study habits
Neuroimaging
Brain activity patterns vs behavioral measures
Multiple Regression
CCA with q=1 equals multiple regression (R = ρ₁)
PCA
CCA with X=Y yields principal components
Discriminant Analysis
LDA is CCA with one set being group indicators
PLS
Partial Least Squares maximizes covariance instead of correlation
R
cancor(), CCA package, vegan::cca()
Python
sklearn.cross_decomposition.CCA
SAS
PROC CANCORR
SPSS
Use MANOVA syntax or macros
Sample Size
n should exceed 10(p+q); unstable with small samples
Variable Selection
Include theoretically relevant variables; avoid redundancy
Outliers
Check multivariate outliers; can strongly influence results
Multicollinearity
High collinearity within sets causes instability
For high-dimensional data or when n < p+q:
Ridge CCA
Add penalty to within-set covariance matrices
Sparse CCA
L1 penalty for variable selection (LASSO-type)
Capture nonlinear relationships via kernel trick:
Idea
Map variables to high-dimensional space where relationships are linear
Kernels
Polynomial, RBF (Gaussian), sigmoid
Neural network-based approach for complex relationships:
Learn nonlinear transformations of X and Y that maximize correlation; useful for multi-modal data (e.g., image + text)
Research Question: How do cognitive abilities relate to academic achievement?
Set X (Cognitive)
Verbal reasoning, Spatial ability, Processing speed, Working memory (p=4)
Set Y (Academic)
Math score, Reading score, Science score (q=3)
Results (n=200)
(p < 0.001): General cognitive ability ↔ Overall achievement
(p < 0.01): Verbal vs Spatial ↔ Reading vs Math/Science
(p > 0.05): Not significant
Interpretation
Two meaningful dimensions: (1) General ability-achievement link explains most shared variance; (2) Specific verbal-spatial pattern relates to reading vs STEM performance
Over-interpretation
Don't interpret all canonical dimensions; focus on significant ones
Ignoring Loadings
Coefficients alone can be misleading; examine structure coefficients
Small Sample
CCA unstable with n < 10(p+q); correlations can be spuriously high
Confusing with Regression
CCA is symmetric; neither set is "dependent"
Essential information to include:
Canonical Correlations
Report values and significance tests
Redundancy Analysis
Variance explained in each variable set
Structure Coefficients
Canonical loadings for interpretation
Sample Size
Report n, p, q, and ratio
Scenario: Examine relationship between academic skills and achievement
Set 1 (Skills): Reading comprehension, Math reasoning, Verbal ability
Set 2 (Achievement): GPA, Test scores, Assignment grades
First Canonical Pair
Overall academic ability
Redundancy
Skills explain 45% of achievement variance
Interpretation
Strong skills-achievement relationship
Multiple Regression: Multiple predictors → single outcome
CCA: Multiple predictors ↔ multiple outcomes (symmetric)
Use CCA when: Multiple DVs all equally important
MANOVA: Categorical IVs → multiple continuous DVs
CCA: Continuous variables in both sets
Use CCA when: All variables continuous
PCA: Reduce single set of variables
CCA: Find relationships between two sets
Use CCA when: Two distinct variable sets to relate
SEM: Test specific structural model with latent variables
CCA: Exploratory symmetric relationships
Use SEM when: Theory-driven confirmatory analysis
1. Linearity
Relationships should be linear; check scatter plots
2. Multivariate Normality
Use Mardia's test or Q-Q plots
3. Homoscedasticity
Constant variance across canonical variates
4. No Multicollinearity
Check VIF; avoid near-perfect correlations
The first canonical correlation maximizes:
Subject to normalization constraints:
This reduces to finding eigenvectors of or equivalently
Using Lagrange multipliers with the normalization constraints:
Taking derivatives and setting to zero yields:
Key Result
At optimum, (the canonical correlation)
Solution
Combine equations to get eigenvalue problem
The canonical weight vectors satisfy:
Note: Both matrices have the same non-zero eigenvalues where r = min(p, q). The canonical correlations are the square roots of these eigenvalues.
Consider two variable sets with correlation structure:
Set 1: X₁, X₂ (p=2)
Set 2: Y₁, Y₂ (q=2)
Correlation matrix:
R₁₁
R₁₂
R₂₂
Step 1: Compute the matrix product:
Step 2: Find eigenvalues of M:
Step 3: Canonical correlations:
First Dimension
ρ₁ = 0.70 captures main relationship between sets
Second Dimension
ρ₂ = 0.30 captures residual relationship
Test overall relationship with n=100:
Degrees of Freedom
Decision
χ² = 75.0, p < 0.001 → Reject H₀
Conclusion: Significant relationship between the two variable sets exists.
Canonical Weights (Coefficients)
Vectors that define the linear combinations
May be difficult to interpret due to multicollinearity
Canonical Loadings (Structure)
Correlations and
Preferred for interpretation; analogous to factor loadings
Best Practice: Report both weights and loadings. Use loadings with |r| > 0.3 or 0.4 for substantive interpretation of what each canonical variate represents.
Suppose first canonical pair has loadings:
U₁ Loadings (Set 1)
Verbal: 0.85, Math: 0.80, Spatial: 0.45
→ General cognitive ability
V₁ Loadings (Set 2)
GPA: 0.90, Test: 0.85, Projects: 0.70
→ Overall academic success
Interpretation: First dimension (ρ₁ = 0.75) represents the relationship between general cognitive ability and overall academic performance.
Minimum Requirement
n > p + q (matrix invertibility)
Conservative Rule
n ≥ 10(p + q) for stable estimates
For Inference
n ≥ 20(p + q) for reliable hypothesis tests
Shrinkage Issue
Sample canonical correlations overestimate population values
Power Consideration: Power depends on population canonical correlations, sample size, and number of variables. Use simulation or pilot data to plan adequate sample size.
When one set has a single variable, CCA reduces to multiple regression. The canonical correlation equals the multiple R.
Rule of thumb: n should be at least 10 times the total number of variables. CCA is sensitive to sample size, especially with many variables.
Standard CCA assumes continuous variables. For categorical data, consider correspondence analysis or use dummy coding (with caution about interpretation).
Loadings >0.3 are considered meaningful. They show the correlation between original variables and canonical variates. Focus on structure coefficients for interpretation.
Redundancy measures the proportion of variance in one variable set explained by the canonical variates of the other set. It provides practical effect size.