Master the art and science of Bayesian statistical inference: from philosophical foundations to practical applications, learn to update beliefs with data and quantify uncertainty
Core mathematical framework underlying Bayesian inference
Rigorous derivations of fundamental Bayesian results
For a parameter θ and data x, the posterior density is proportional to the likelihood times the prior.
This formula tells us exactly how to update our beliefs (prior) with new evidence (likelihood) to form new beliefs (posterior).
By definition of conditional density for continuous random variables:
The joint density f(x, θ) can be written as likelihood times prior:
The marginal density of x is obtained by integrating out θ from the joint density:
Substituting the numerator and denominator back into the conditional probability definition:
Since the denominator m(x) does not depend on θ, it is a normalizing constant.
If we observe x₂ after x₁, the posterior π(θ|x₁) becomes the new prior.
If prior is (Uniform) and we observe 1 success in 1 trial (Binomial):
Posterior is proportional to:
which is .
The posterior mean E[θ|X] minimizes the posterior expected squared error loss.
This justifies why we often use the posterior mean as a point estimate.
Consider the Squared Error Loss function L(θ, δ) = (θ - δ)². We want to minimize the posterior expected loss (risk).
Add and subtract the posterior mean μ(x) = E[θ|x] inside the square.
Expand the square: (A+B)² = A² + B² + 2AB.
Take expectation with respect to π(θ|x). Note that δ and μ(x) are constant w.r.t. this expectation.
The term E[θ - μ(x)|x] = E[θ|x] - μ(x) = μ(x) - μ(x) = 0. So the cross term vanishes.
The risk is Var(θ|x) + (μ(x) - δ)². To minimize this with respect to δ, we must set the second term to zero.
For Normal posterior :
The Bayes estimator under squared loss is simply .
The posterior median minimizes the posterior expected absolute error loss.
This provides the theoretical foundation for using posterior median as a robust point estimate.
Consider the absolute error loss L(θ, δ) = |θ - δ|. The posterior expected loss is:
Separate the integral at δ:
Use Leibniz rule for differentiation under the integral:
The minimum occurs when the derivative equals zero:
F(δ|x) = 0.5 is the definition of the median:
The second derivative is 2π(δ|x) > 0, confirming this is a minimum:
For a skewed posterior distribution:
The median may differ significantly from the mean.
The median is more robust to outliers in the posterior, making it a preferred point estimate when the posterior is asymmetric.
If π(θ) belongs to a conjugate family for likelihood L(x|θ), then the posterior π(θ|x) belongs to the same family.
Conjugate families provide closed-form posterior distributions, enabling exact Bayesian inference.
A family F is conjugate for likelihood L(x|θ) if:
Many conjugate pairs arise from exponential families. The likelihood has form:
The conjugate prior has form matching the sufficient statistics:
Multiply likelihood and prior:
The posterior has the same functional form with updated hyperparameters:
For Binomial data with Beta prior:
For data with prior:
The posterior is:
The Gamma family is conjugate to the Poisson likelihood.
Under regularity conditions, the posterior distribution concentrates on the true parameter value as sample size increases.
Bayesian inference is asymptotically consistent: with enough data, the posterior concentrates at the true value.
Let X₁, ..., Xₙ be i.i.d. from f(x|θ₀). We want to show the posterior concentrates at θ₀.
For any neighborhood U of θ₀, we need:
The key is the log-likelihood ratio: for θ ≠ θ₀, by the law of large numbers:
The Kullback-Leibler divergence is positive for θ ≠ θ₀:
The posterior mass outside U shrinks exponentially:
The posterior concentrates on arbitrarily small neighborhoods of θ₀:
For Normal data with any bounded prior on :
The posterior for concentrates at the true value as :
This demonstrates Bayesian consistency.
Problem:
A coin has unknown probability of heads. We use a Beta(2, 2) prior (symmetric, centered at 0.5). After observing 7 heads in 10 flips, find the posterior distribution and the Bayesian estimate of .
Solution:
Key Insight:
The Beta prior naturally updates to a Beta posterior. The posterior mean (0.643) is between the prior mean (0.5) and the sample proportion (0.7), reflecting the combination of prior belief and data evidence.
Problem:
Data with known . Use prior . Given observations with , find the posterior distribution of .
Solution:
Key Insight:
The posterior mean (11.96) is a weighted average of the prior mean (10) and sample mean (12), with weights proportional to precisions. The posterior variance (0.157) is smaller than both prior and sample variances, reflecting the combination of information.
Step-by-step solutions to advanced Bayesian inference problems
Problem:
A factory records the number of defects per day. Historical data suggests an average of 3 defects/day. We use Gamma(6, 2) as prior for λ. After 10 days with total 25 defects, find the posterior distribution and Bayes estimate.
Solution:
Identify the model
Poisson likelihood with Gamma prior forms a conjugate pair:
Prior parameters
The prior Gamma(6, 2) has mean α/β = 3 and variance α/β² = 1.5:
Likelihood function
For n=10 days with Σxᵢ = 25 defects:
Apply conjugacy
Posterior is Gamma(α + Σxᵢ, β + n):
Posterior moments
Calculate posterior mean and variance:
95% credible interval
Using Gamma(31, 12) quantiles:
Key Insight:
The posterior mean (2.583) is between the prior mean (3) and the sample mean (25/10 = 2.5), but closer to the sample mean because n=10 provides substantial data. The Gamma-Poisson conjugacy provides closed-form solutions.
Problem:
After observing 7 heads in 10 coin flips with Beta(2,2) prior, compute the predictive probability of getting exactly 2 heads in the next 3 flips.
Solution:
Find posterior
With Beta(2,2) prior and 7 heads in 10 flips:
Set up predictive distribution
The posterior predictive for Y = number of heads in 3 future flips:
Use Beta-Binomial formula
For Beta(a,b) posterior and Binomial(m,p) prediction:
Substitute values
With a=9, b=5, m=3, y=2:
Calculate
Using Gamma function properties:
Interpretation
The probability of exactly 2 heads in 3 future flips is about 40.2%:
Key Insight:
The posterior predictive distribution integrates over parameter uncertainty, giving more realistic predictions than plug-in estimates. The Beta-Binomial model provides closed-form solutions for this common scenario.
Problem:
For a Beta(8, 4) posterior distribution, find the 95% HPD credible interval and compare it to the equal-tail interval.
Solution:
Understand HPD
HPD is the shortest interval with given probability. For unimodal distributions:
Equal-tail interval
Using Beta(8,4) quantiles at 2.5% and 97.5%:
HPD construction
For Beta(8,4), the mode is at (8-1)/(8+4-2) = 7/10 = 0.7. HPD is centered near mode.
Find HPD numerically
HPD bounds satisfy π(θ_L) = π(θ_U) and ∫π(θ)dθ = 0.95:
Compare interval lengths
Calculate the width of each interval:
Conclusion
HPD is shorter (0.431 vs 0.444) and more efficient. HPD provides the shortest interval with 95% coverage.
Key Insight:
For asymmetric distributions like Beta(8,4), HPD intervals are shorter than equal-tail intervals. HPD intervals include all points with highest posterior density, making them optimal for reporting uncertainty.
Problem:
Derive the Jeffreys prior for the Bernoulli parameter p and show it is Beta(1/2, 1/2).
Solution:
Jeffreys prior formula
Jeffreys prior is proportional to square root of Fisher Information:
Bernoulli log-likelihood
For X ~ Bernoulli(p):
First derivative
Compute score function:
Second derivative
Compute Hessian:
Fisher Information
Take expectation (E[X] = p):
Jeffreys prior
Take square root:
Key Insight:
Jeffreys prior Beta(1/2, 1/2) is a reference prior that is invariant under reparametrization. It places more weight near 0 and 1 than the uniform prior, reflecting that extreme probabilities may be more common in practice.
Problem:
Given data from N(μ, σ²) with both unknown, use conjugate Normal-Inverse-Gamma prior. With prior μ|σ² ~ N(0, σ²/κ₀) and σ² ~ Inv-Gamma(ν₀/2, ν₀σ₀²/2), derive the posterior.
Solution:
Joint prior specification
The Normal-Inverse-Gamma prior is conjugate for (μ, σ²):
Set prior hyperparameters
Choose weakly informative priors: κ₀ = 1, μ₀ = 0, ν₀ = 2, σ₀² = 1:
Posterior parameters for μ
The posterior for μ|σ², data is Normal with updated parameters:
Posterior parameters for σ²
The marginal posterior for σ² is Inverse-Gamma:
Example calculation
With n=20, x̄=5, s²=4 and prior κ₀=1, μ₀=0, ν₀=2, σ₀²=1:
Marginal posterior for μ
Integrating out σ², the marginal posterior for μ is Student-t:
Key Insight:
The Normal-Inverse-Gamma conjugate prior allows joint inference on mean and variance. The posterior mean shrinks the sample mean toward the prior mean, with the degree of shrinkage depending on κ₀ and n.
Problem:
In a meta-analysis of 8 clinical trials, observed effect sizes are: 0.5, 0.8, 0.3, 1.2, 0.6, 0.4, 0.9, 0.7 with known within-study variance σ² = 0.1. Use empirical Bayes to estimate the true effects and the between-study variance τ².
Solution:
Hierarchical model
Assume yᵢ|θᵢ ~ N(θᵢ, σ²) and θᵢ ~ N(μ, τ²):
Marginal distribution
Marginally, yᵢ ~ N(μ, σ² + τ²):
Estimate μ and τ² from data
Use method of moments or ML. Sample mean and between-study variance:
Calculate τ²
Sample variance of effects is 0.082. Subtract within-study variance:
Shrinkage estimator
The empirical Bayes estimate for θᵢ shrinks yᵢ toward μ̂:
Calculate shrinkage
With τ² = 0.05 and σ² = 0.1, shrinkage factor B = 0.05/(0.15) = 1/3:
Example shrinkage
For y₄ = 1.2 (largest): θ̂₄ = (1/3)(1.2) + (2/3)(0.675) = 0.85:
Key Insight:
Empirical Bayes estimates the hyperparameters (μ, τ²) from the data itself, then uses these to construct a "pseudo-posterior" for each θᵢ. This provides adaptive shrinkage: extreme observations are pulled toward the overall mean.
Test your understanding with 10 multiple-choice questions
Understanding the fundamental philosophical and practical differences
| Aspect | Classical | Bayesian | Bayesian Advantage |
|---|---|---|---|
| Parameter Nature | Fixed unknown constant | Random variable with distribution | Natural uncertainty representation |
| Information Used | Sample data only | Prior knowledge + sample data | Incorporates domain expertise |
| Interval Interpretation | 95% of intervals contain parameter | 95% probability parameter in interval | Direct probability statement |
| Small Sample Performance | May have poor coverage | Stabilized by prior information | Better finite-sample properties |
| Sequential Analysis | Requires stopping rules | Natural updating framework | Flexible data collection |
Common questions about Bayesian statistics and inference