MathIsimple
Advanced Inference
8-12 Hours

Bayesian Statistics & Inference

Master the art and science of Bayesian statistical inference: from philosophical foundations to practical applications, learn to update beliefs with data and quantify uncertainty

Mathematical Foundations

Core mathematical framework underlying Bayesian inference

Bayes' Theorem - The Heart of Bayesian Inference
Posterior = (Likelihood × Prior) / Evidence
π(θx~)=p(x~θ)π(θ)p(x~)\pi(\theta|\tilde{x}) = \frac{p(\tilde{x}|\theta) \pi(\theta)}{p(\tilde{x})}

Key Components:

π(θ|x̃): Posterior distribution - updated beliefs about θ
p(x̃|θ): Likelihood function - probability of data given θ
π(θ): Prior distribution - initial beliefs about θ
p(x̃): Marginal likelihood - normalizing constant
Beta-Binomial Conjugacy Example
Classic example of conjugate prior updating
Beta(a,b)+Binomial(n,x)Beta(a+x,b+nx)\text{Beta}(a,b) + \text{Binomial}(n,x) \rightarrow \text{Beta}(a+x, b+n-x)

Key Components:

Prior: θ ~ Beta(a,b) represents beliefs about success probability
Data: x successes in n trials from Binomial(n,θ)
Posterior: θ|data ~ Beta(a+x, b+n-x)
Interpretation: add successes to a, failures to b
Bayesian Credible Interval
Direct probability statement about parameter location
P(θLθθUdata)=1αP(\theta_L \leq \theta \leq \theta_U | \text{data}) = 1-\alpha

Key Components:

1-α probability that θ lies in the interval given the data
Different from confidence intervals (frequency interpretation)
Can be computed using posterior quantiles
Natural for decision-making and risk assessment
Posterior Predictive Distribution
Distribution of future observations accounting for parameter uncertainty
p(zx~)=p(zθ)π(θx~)dθp(z|\tilde{x}) = \int p(z|\theta) \pi(\theta|\tilde{x}) d\theta

Key Components:

z: future observation to be predicted
Integration over all possible parameter values
Weighted by posterior probability of each θ
Includes both aleatory and epistemic uncertainty

Core Theorem Proofs

Rigorous derivations of fundamental Bayesian results

Bayes' Theorem (Continuous Case)
The Foundation of Bayesian Inference

For a parameter θ and data x, the posterior density is proportional to the likelihood times the prior.

Theorem Statement

π(θx)=f(xθ)π(θ)f(xθ)π(θ)dθ\pi(\theta|x) = \frac{f(x|\theta)\pi(\theta)}{\int f(x|\theta')\pi(\theta')d\theta'}

This formula tells us exactly how to update our beliefs (prior) with new evidence (likelihood) to form new beliefs (posterior).

Proof Steps

1
Definition of Conditional Probability

By definition of conditional density for continuous random variables:

π(θx)=f(x,θ)m(x)\pi(\theta|x) = \frac{f(x, \theta)}{m(x)}
2
Joint Density Decomposition

The joint density f(x, θ) can be written as likelihood times prior:

f(x,θ)=f(xθ)π(θ)f(x, \theta) = f(x|\theta)\pi(\theta)
3
Marginal Density (Evidence)

The marginal density of x is obtained by integrating out θ from the joint density:

m(x)=f(x,θ)dθ=f(xθ)π(θ)dθm(x) = \int f(x, \theta') d\theta' = \int f(x|\theta')\pi(\theta') d\theta'
4
Substitute and Combine

Substituting the numerator and denominator back into the conditional probability definition:

π(θx)=f(xθ)π(θ)f(xθ)π(θ)dθ\pi(\theta|x) = \frac{f(x|\theta)\pi(\theta)}{\int f(x|\theta')\pi(\theta')d\theta'}
5
Proportionality Form

Since the denominator m(x) does not depend on θ, it is a normalizing constant.

π(θx)f(xθ)π(θ)\pi(\theta|x) \propto f(x|\theta)\pi(\theta)
6
Sequential Updating

If we observe x₂ after x₁, the posterior π(θ|x₁) becomes the new prior.

π(θx1,x2)f(x2θ)π(θx1)f(x2θ)f(x1θ)π(θ)\pi(\theta|x_1, x_2) \propto f(x_2|\theta)\pi(\theta|x_1) \propto f(x_2|\theta)f(x_1|\theta)\pi(\theta)

Example Application

If prior π(θ)\pi(\theta) is Beta(1,1)\text{Beta}(1, 1) (Uniform) and we observe 1 success in 1 trial (Binomial):

Posterior is proportional to:

θ1(1θ)01=θ\theta^1 (1-\theta)^0 \cdot 1 = \theta

which is Beta(2,1)\text{Beta}(2, 1).

Optimality of Posterior Mean
Bayes Estimator under Squared Error Loss

The posterior mean E[θ|X] minimizes the posterior expected squared error loss.

Theorem Statement

θ^Bayes=argminδE[(θδ)2X]=E[θX]\hat{\theta}_{Bayes} = \arg\min_{\delta} E[(\theta - \delta)^2 | X] = E[\theta|X]

This justifies why we often use the posterior mean as a point estimate.

Proof Steps

1
Define Loss Function

Consider the Squared Error Loss function L(θ, δ) = (θ - δ)². We want to minimize the posterior expected loss (risk).

R(δx)=E[(θδ)2x]=(θδ)2π(θx)dθR(\delta|x) = E[(\theta - \delta)^2 | x] = \int (\theta - \delta)^2 \pi(\theta|x) d\theta
2
Expand the Quadratic

Add and subtract the posterior mean μ(x) = E[θ|x] inside the square.

(θδ)2=((θμ(x))+(μ(x)δ))2(\theta - \delta)^2 = ((\theta - \mu(x)) + (\mu(x) - \delta))^2
3
Expand Terms

Expand the square: (A+B)² = A² + B² + 2AB.

(θμ(x))2+(μ(x)δ)2+2(θμ(x))(μ(x)δ)(\theta - \mu(x))^2 + (\mu(x) - \delta)^2 + 2(\theta - \mu(x))(\mu(x) - \delta)
4
Take Expectation

Take expectation with respect to π(θ|x). Note that δ and μ(x) are constant w.r.t. this expectation.

E[(θδ)2x]=E[(θμ(x))2x]+(μ(x)δ)2+2(μ(x)δ)E[θμ(x)x]E[(\theta - \delta)^2|x] = E[(\theta - \mu(x))^2|x] + (\mu(x) - \delta)^2 + 2(\mu(x) - \delta)E[\theta - \mu(x)|x]
5
Analyze Cross Term

The term E[θ - μ(x)|x] = E[θ|x] - μ(x) = μ(x) - μ(x) = 0. So the cross term vanishes.

2(μ(x)δ)0=02(\mu(x) - \delta) \cdot 0 = 0
6
Minimize

The risk is Var(θ|x) + (μ(x) - δ)². To minimize this with respect to δ, we must set the second term to zero.

(μ(x)δ)2=0    δ=μ(x)=E[θx](\mu(x) - \delta)^2 = 0 \implies \delta = \mu(x) = E[\theta|x]

Example Application

For Normal posterior N(μn,σn2)N(\mu_n, \sigma_n^2):

The Bayes estimator under squared loss is simply μn\mu_n.

θ^Bayes=E[θX]=μn\hat{\theta}_{\text{Bayes}} = E[\theta | X] = \mu_n
Optimality of Posterior Median
Bayes Estimator under Absolute Loss

The posterior median minimizes the posterior expected absolute error loss.

Theorem Statement

θ^median=argminδE[θδX]=median(θX)\hat{\theta}_{\text{median}} = \arg\min_{\delta} E[|\theta - \delta| \, | \, X] = \text{median}(\theta | X)

This provides the theoretical foundation for using posterior median as a robust point estimate.

Proof Steps

1
Define Absolute Loss

Consider the absolute error loss L(θ, δ) = |θ - δ|. The posterior expected loss is:

R(δx)=E[θδx]=θδπ(θx)dθR(\delta|x) = E[|\theta - \delta| \, | \, x] = \int |\theta - \delta| \pi(\theta|x) d\theta
2
Split the Integral

Separate the integral at δ:

R(δx)=δ(δθ)π(θx)dθ+δ(θδ)π(θx)dθR(\delta|x) = \int_{-\infty}^{\delta} (\delta - \theta) \pi(\theta|x) d\theta + \int_{\delta}^{\infty} (\theta - \delta) \pi(\theta|x) d\theta
3
Differentiate with respect to δ

Use Leibniz rule for differentiation under the integral:

dRdδ=δπ(θx)dθδπ(θx)dθ=F(δx)(1F(δx))\frac{dR}{d\delta} = \int_{-\infty}^{\delta} \pi(\theta|x) d\theta - \int_{\delta}^{\infty} \pi(\theta|x) d\theta = F(\delta|x) - (1 - F(\delta|x))
4
Set Derivative to Zero

The minimum occurs when the derivative equals zero:

dRdδ=0    2F(δx)1=0    F(δx)=0.5\frac{dR}{d\delta} = 0 \implies 2F(\delta|x) - 1 = 0 \implies F(\delta|x) = 0.5
5
Identify the Median

F(δ|x) = 0.5 is the definition of the median:

δ=median(θx)=F1(0.5x)\delta^* = \text{median}(\theta|x) = F^{-1}(0.5|x)
6
Verify Minimum

The second derivative is 2π(δ|x) > 0, confirming this is a minimum:

d2Rdδ2=2π(δx)>0    minimum\frac{d^2R}{d\delta^2} = 2\pi(\delta|x) > 0 \implies \text{minimum}

Example Application

For a skewed posterior distribution:

The median may differ significantly from the mean.

The median is more robust to outliers in the posterior, making it a preferred point estimate when the posterior is asymmetric.

θ^median=F1(0.5X)\hat{\theta}_{\text{median}} = F^{-1}(0.5 | X)
Conjugate Prior Closure Property
Foundation for Analytical Bayesian Inference

If π(θ) belongs to a conjugate family for likelihood L(x|θ), then the posterior π(θ|x) belongs to the same family.

Theorem Statement

π(θ)F and conjugacy     π(θx)F\pi(\theta) \in \mathcal{F} \text{ and conjugacy } \implies \pi(\theta|x) \in \mathcal{F}

Conjugate families provide closed-form posterior distributions, enabling exact Bayesian inference.

Proof Steps

1
Definition of Conjugacy

A family F is conjugate for likelihood L(x|θ) if:

π(θ)F    π(θx)=L(xθ)π(θ)L(xθ)π(θ)dθF\pi(\theta) \in \mathcal{F} \implies \pi(\theta|x) = \frac{L(x|\theta)\pi(\theta)}{\int L(x|\theta')\pi(\theta')d\theta'} \in \mathcal{F}
2
Exponential Family Representation

Many conjugate pairs arise from exponential families. The likelihood has form:

L(xθ)=h(x)exp{η(θ)TT(x)A(θ)}L(x|\theta) = h(x) \exp\{\eta(\theta)^T T(x) - A(\theta)\}
3
Natural Conjugate Prior

The conjugate prior has form matching the sufficient statistics:

π(θτ,ν)exp{η(θ)TτνA(θ)}\pi(\theta|\tau, \nu) \propto \exp\{\eta(\theta)^T \tau - \nu A(\theta)\}
4
Posterior Calculation

Multiply likelihood and prior:

π(θx)exp{η(θ)T(τ+T(x))(ν+1)A(θ)}\pi(\theta|x) \propto \exp\{\eta(\theta)^T (\tau + T(x)) - (\nu + 1) A(\theta)\}
5
Identify Updated Parameters

The posterior has the same functional form with updated hyperparameters:

τn=τ+T(x),νn=ν+1\tau_n = \tau + T(x), \quad \nu_n = \nu + 1
6
Beta-Binomial Example

For Binomial data with Beta prior:

Beta(a,b)+Binom(n,x)Beta(a+x,b+nx)\text{Beta}(a,b) + \text{Binom}(n,x) \to \text{Beta}(a+x, b+n-x)

Example Application

For Poisson(λ)\text{Poisson}(\lambda) data with Gamma(α,β)\text{Gamma}(\alpha, \beta) prior:

The posterior is:

Gamma(α+i=1nxi,β+n)\text{Gamma}\left(\alpha + \sum_{i=1}^n x_i, \, \beta + n\right)

The Gamma family is conjugate to the Poisson likelihood.

Posterior Consistency (Doob's Theorem)
Asymptotic Justification for Bayesian Methods

Under regularity conditions, the posterior distribution concentrates on the true parameter value as sample size increases.

Theorem Statement

π(θθ0>ϵX1,,Xn)a.s.0 as n\pi(|\theta - \theta_0| > \epsilon \, | \, X_1, \ldots, X_n) \xrightarrow{a.s.} 0 \text{ as } n \to \infty

Bayesian inference is asymptotically consistent: with enough data, the posterior concentrates at the true value.

Proof Steps

1
Setup

Let X₁, ..., Xₙ be i.i.d. from f(x|θ₀). We want to show the posterior concentrates at θ₀.

X1,,Xniidf(xθ0)X_1, \ldots, X_n \stackrel{iid}{\sim} f(x|\theta_0)
2
Posterior Concentration

For any neighborhood U of θ₀, we need:

π(θUcX1,,Xn)0 a.s. under Pθ0\pi(\theta \in U^c | X_1, \ldots, X_n) \to 0 \text{ a.s. under } P_{\theta_0}
3
Likelihood Ratio Analysis

The key is the log-likelihood ratio: for θ ≠ θ₀, by the law of large numbers:

1nlogLn(θ)Ln(θ0)DKL(θ0θ)<0 a.s.\frac{1}{n} \log \frac{L_n(\theta)}{L_n(\theta_0)} \to -D_{KL}(\theta_0 \| \theta) < 0 \text{ a.s.}
4
KL Divergence Separation

The Kullback-Leibler divergence is positive for θ ≠ θ₀:

DKL(θ0θ)=Eθ0[logf(Xθ0)f(Xθ)]>0D_{KL}(\theta_0 \| \theta) = E_{\theta_0}\left[\log \frac{f(X|\theta_0)}{f(X|\theta)}\right] > 0
5
Posterior Ratio Bound

The posterior mass outside U shrinks exponentially:

π(UcX1:n)π(Uc)π(U)supθUcLn(θ)Ln(θ0)0\pi(U^c|X_{1:n}) \leq \frac{\pi(U^c)}{\pi(U)} \cdot \sup_{\theta \in U^c} \frac{L_n(\theta)}{L_n(\theta_0)} \to 0
6
Conclusion

The posterior concentrates on arbitrarily small neighborhoods of θ₀:

π(θBϵ(θ0)X1:n)1 a.s.\pi(\theta \in B_\epsilon(\theta_0) | X_{1:n}) \to 1 \text{ a.s.}

Example Application

For Normal data with any bounded prior on μ\mu:

The posterior for μ\mu concentrates at the true value as nn \to \infty:

π(μμ0>ϵX1,,Xn)a.s.0\pi\left(|\mu - \mu_0| > \epsilon \, | \, X_1, \ldots, X_n\right) \xrightarrow{a.s.} 0

This demonstrates Bayesian consistency.

Example: Beta-Binomial Conjugate Prior Updating

Problem:

A coin has unknown probability pp of heads. We use a Beta(2, 2) prior (symmetric, centered at 0.5). After observing 7 heads in 10 flips, find the posterior distribution and the Bayesian estimate of pp.

Solution:

  1. Prior distribution:
    π(p)=Beta(a=2,b=2)=p21(1p)21B(2,2)=p(1p)B(2,2)\pi(p) = \text{Beta}(a=2, b=2) = \frac{p^{2-1}(1-p)^{2-1}}{B(2,2)} = \frac{p(1-p)}{B(2,2)}
    where B(2,2)=Γ(2)Γ(2)Γ(4)=1!1!3!=16B(2,2) = \frac{\Gamma(2)\Gamma(2)}{\Gamma(4)} = \frac{1! \cdot 1!}{3!} = \frac{1}{6}
  2. Likelihood function: For x=7x = 7 heads in n=10n = 10 flips:
    L(p)=(107)p7(1p)3L(p) = \binom{10}{7} p^7 (1-p)^3
  3. Posterior distribution: Using Bayes' theorem:
    π(pdata)L(p)π(p)=p7(1p)3p(1p)=p8(1p)4\pi(p|\text{data}) \propto L(p) \pi(p) = p^7(1-p)^3 \cdot p(1-p) = p^8(1-p)^4
    This is proportional to Beta(9, 5):
    π(pdata)=Beta(a+x,b+nx)=Beta(2+7,2+107)=Beta(9,5)\pi(p|\text{data}) = \text{Beta}(a+x, b+n-x) = \text{Beta}(2+7, 2+10-7) = \text{Beta}(9, 5)
  4. Posterior mean (Bayesian estimate):
    E[pdata]=a+xa+b+n=99+5=914=0.643E[p|\text{data}] = \frac{a+x}{a+b+n} = \frac{9}{9+5} = \frac{9}{14} = 0.643
  5. Posterior variance:
    Var(pdata)=(a+x)(b+nx)(a+b+n)2(a+b+n+1)\text{Var}(p|\text{data}) = \frac{(a+x)(b+n-x)}{(a+b+n)^2(a+b+n+1)}=9×5142×15=452940=0.0153= \frac{9 \times 5}{14^2 \times 15} = \frac{45}{2940} = 0.0153
  6. 95% credible interval: Using Beta(9,5) quantiles:
    P(0.42p0.84data)=0.95P(0.42 \leq p \leq 0.84 | \text{data}) = 0.95

Key Insight:

The Beta prior naturally updates to a Beta posterior. The posterior mean (0.643) is between the prior mean (0.5) and the sample proportion (0.7), reflecting the combination of prior belief and data evidence.

Example: Normal-Normal Conjugate Prior for Mean

Problem:

Data X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim N(\mu, \sigma^2) with known σ2=4\sigma^2 = 4. Use prior μN(μ0=10,τ02=9)\mu \sim N(\mu_0 = 10, \tau_0^2 = 9). Given n=25n = 25 observations with xˉ=12\bar{x} = 12, find the posterior distribution of μ\mu.

Solution:

  1. Prior parameters:
    μ0=10,τ02=9,τ0=3\mu_0 = 10, \quad \tau_0^2 = 9, \quad \tau_0 = 3
  2. Sample information:
    n=25,xˉ=12,σ2=4,σ=2n = 25, \quad \bar{x} = 12, \quad \sigma^2 = 4, \quad \sigma = 2
  3. Posterior precision:
    1τn2=1τ02+nσ2=19+254=19+22536=4+22536=22936\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} = \frac{1}{9} + \frac{25}{4} = \frac{1}{9} + \frac{225}{36} = \frac{4 + 225}{36} = \frac{229}{36}
    Therefore: τn2=36229=0.157\tau_n^2 = \frac{36}{229} = 0.157
  4. Posterior mean:
    μn=μ0τ02+nxˉσ21τ02+nσ2=109+25×12422936\mu_n = \frac{\frac{\mu_0}{\tau_0^2} + \frac{n\bar{x}}{\sigma^2}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}} = \frac{\frac{10}{9} + \frac{25 \times 12}{4}}{\frac{229}{36}}=109+7522936=10+675922936=685/9229/36=685×369×229=246602061=11.96= \frac{\frac{10}{9} + 75}{\frac{229}{36}} = \frac{\frac{10 + 675}{9}}{\frac{229}{36}} = \frac{685/9}{229/36} = \frac{685 \times 36}{9 \times 229} = \frac{24660}{2061} = 11.96
  5. Posterior distribution:
    μdataN(μn=11.96,τn2=0.157)\mu | \text{data} \sim N(\mu_n = 11.96, \tau_n^2 = 0.157)
  6. 95% credible interval:
    μn±1.96×τn=11.96±1.96×0.396=[11.18,12.74]\mu_n \pm 1.96 \times \tau_n = 11.96 \pm 1.96 \times 0.396 = [11.18, 12.74]

Key Insight:

The posterior mean (11.96) is a weighted average of the prior mean (10) and sample mean (12), with weights proportional to precisions. The posterior variance (0.157) is smaller than both prior and sample variances, reflecting the combination of information.

Advanced Worked Examples

Step-by-step solutions to advanced Bayesian inference problems

Gamma-Poisson Conjugate Prior

Problem:

A factory records the number of defects per day. Historical data suggests an average of 3 defects/day. We use Gamma(6, 2) as prior for λ. After 10 days with total 25 defects, find the posterior distribution and Bayes estimate.

Solution:

  1. 1

    Identify the model

    Poisson likelihood with Gamma prior forms a conjugate pair:

    XiPoisson(λ),λGamma(α,β)X_i \sim \text{Poisson}(\lambda), \quad \lambda \sim \text{Gamma}(\alpha, \beta)
  2. 2

    Prior parameters

    The prior Gamma(6, 2) has mean α/β = 3 and variance α/β² = 1.5:

    π(λ)=Gamma(α=6,β=2):E[λ]=3,Var(λ)=1.5\pi(\lambda) = \text{Gamma}(\alpha=6, \beta=2): \quad E[\lambda] = 3, \quad Var(\lambda) = 1.5
  3. 3

    Likelihood function

    For n=10 days with Σxᵢ = 25 defects:

    L(λ)=i=110λxieλxi!λ25e10λL(\lambda) = \prod_{i=1}^{10} \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} \propto \lambda^{25} e^{-10\lambda}
  4. 4

    Apply conjugacy

    Posterior is Gamma(α + Σxᵢ, β + n):

    π(λdata)=Gamma(6+25,2+10)=Gamma(31,12)\pi(\lambda|\text{data}) = \text{Gamma}(6 + 25, 2 + 10) = \text{Gamma}(31, 12)
  5. 5

    Posterior moments

    Calculate posterior mean and variance:

    E[λdata]=3112=2.583,Var(λdata)=31144=0.215E[\lambda|\text{data}] = \frac{31}{12} = 2.583, \quad Var(\lambda|\text{data}) = \frac{31}{144} = 0.215
  6. 6

    95% credible interval

    Using Gamma(31, 12) quantiles:

    P(1.77λ3.58data)=0.95P(1.77 \leq \lambda \leq 3.58 | \text{data}) = 0.95

Key Insight:

The posterior mean (2.583) is between the prior mean (3) and the sample mean (25/10 = 2.5), but closer to the sample mean because n=10 provides substantial data. The Gamma-Poisson conjugacy provides closed-form solutions.

Posterior Predictive Distribution

Problem:

After observing 7 heads in 10 coin flips with Beta(2,2) prior, compute the predictive probability of getting exactly 2 heads in the next 3 flips.

Solution:

  1. 1

    Find posterior

    With Beta(2,2) prior and 7 heads in 10 flips:

    π(pdata)=Beta(2+7,2+3)=Beta(9,5)\pi(p|\text{data}) = \text{Beta}(2+7, 2+3) = \text{Beta}(9, 5)
  2. 2

    Set up predictive distribution

    The posterior predictive for Y = number of heads in 3 future flips:

    P(Y=ydata)=01(3y)py(1p)3yπ(pdata)dpP(Y=y|\text{data}) = \int_0^1 \binom{3}{y} p^y (1-p)^{3-y} \cdot \pi(p|\text{data}) dp
  3. 3

    Use Beta-Binomial formula

    For Beta(a,b) posterior and Binomial(m,p) prediction:

    P(Y=y)=(my)B(a+y,b+my)B(a,b)P(Y=y) = \binom{m}{y} \frac{B(a+y, b+m-y)}{B(a,b)}
  4. 4

    Substitute values

    With a=9, b=5, m=3, y=2:

    P(Y=2)=(32)B(11,6)B(9,5)=3Γ(11)Γ(6)/Γ(17)Γ(9)Γ(5)/Γ(14)P(Y=2) = \binom{3}{2} \frac{B(11, 6)}{B(9, 5)} = 3 \cdot \frac{\Gamma(11)\Gamma(6)/\Gamma(17)}{\Gamma(9)\Gamma(5)/\Gamma(14)}
  5. 5

    Calculate

    Using Gamma function properties:

    P(Y=2)=310!5!13!16!8!4!=31095161514=13503360=0.402P(Y=2) = 3 \cdot \frac{10! \cdot 5! \cdot 13!}{16! \cdot 8! \cdot 4!} = 3 \cdot \frac{10 \cdot 9 \cdot 5}{16 \cdot 15 \cdot 14} = \frac{1350}{3360} = 0.402
  6. 6

    Interpretation

    The probability of exactly 2 heads in 3 future flips is about 40.2%:

    P(Y=2data)0.402P(Y=2|\text{data}) \approx 0.402

Key Insight:

The posterior predictive distribution integrates over parameter uncertainty, giving more realistic predictions than plug-in estimates. The Beta-Binomial model provides closed-form solutions for this common scenario.

Highest Posterior Density (HPD) Interval

Problem:

For a Beta(8, 4) posterior distribution, find the 95% HPD credible interval and compare it to the equal-tail interval.

Solution:

  1. 1

    Understand HPD

    HPD is the shortest interval with given probability. For unimodal distributions:

    HPD:{θ:π(θx)c} where P(θHPD)=0.95\text{HPD}: \{\theta : \pi(\theta|x) \geq c\} \text{ where } P(\theta \in \text{HPD}) = 0.95
  2. 2

    Equal-tail interval

    Using Beta(8,4) quantiles at 2.5% and 97.5%:

    Equal-tail: [F1(0.025),F1(0.975)]=[0.432,0.876]\text{Equal-tail: } [F^{-1}(0.025), F^{-1}(0.975)] = [0.432, 0.876]
  3. 3

    HPD construction

    For Beta(8,4), the mode is at (8-1)/(8+4-2) = 7/10 = 0.7. HPD is centered near mode.

    Mode=α1α+β2=710=0.7\text{Mode} = \frac{\alpha - 1}{\alpha + \beta - 2} = \frac{7}{10} = 0.7
  4. 4

    Find HPD numerically

    HPD bounds satisfy π(θ_L) = π(θ_U) and ∫π(θ)dθ = 0.95:

    HPD: [0.458,0.889]\text{HPD: } [0.458, 0.889]
  5. 5

    Compare interval lengths

    Calculate the width of each interval:

    Equal-tail width: 0.8760.432=0.444,HPD width: 0.8890.458=0.431\text{Equal-tail width: } 0.876 - 0.432 = 0.444, \quad \text{HPD width: } 0.889 - 0.458 = 0.431
  6. 6

    Conclusion

    HPD is shorter (0.431 vs 0.444) and more efficient. HPD provides the shortest interval with 95% coverage.

    HPD width<Equal-tail width    HPD is optimal\text{HPD width} < \text{Equal-tail width} \implies \text{HPD is optimal}

Key Insight:

For asymmetric distributions like Beta(8,4), HPD intervals are shorter than equal-tail intervals. HPD intervals include all points with highest posterior density, making them optimal for reporting uncertainty.

Jeffreys Prior Construction

Problem:

Derive the Jeffreys prior for the Bernoulli parameter p and show it is Beta(1/2, 1/2).

Solution:

  1. 1

    Jeffreys prior formula

    Jeffreys prior is proportional to square root of Fisher Information:

    πJ(θ)I(θ) where I(θ)=E[2θ2logf(Xθ)]\pi_J(\theta) \propto \sqrt{I(\theta)} \text{ where } I(\theta) = E\left[-\frac{\partial^2}{\partial \theta^2} \log f(X|\theta)\right]
  2. 2

    Bernoulli log-likelihood

    For X ~ Bernoulli(p):

    logf(xp)=xlogp+(1x)log(1p)\log f(x|p) = x \log p + (1-x) \log(1-p)
  3. 3

    First derivative

    Compute score function:

    plogf=xp1x1p\frac{\partial}{\partial p} \log f = \frac{x}{p} - \frac{1-x}{1-p}
  4. 4

    Second derivative

    Compute Hessian:

    2p2logf=xp21x(1p)2\frac{\partial^2}{\partial p^2} \log f = -\frac{x}{p^2} - \frac{1-x}{(1-p)^2}
  5. 5

    Fisher Information

    Take expectation (E[X] = p):

    I(p)=E[Xp2+1X(1p)2]=1p+11p=1p(1p)I(p) = E\left[\frac{X}{p^2} + \frac{1-X}{(1-p)^2}\right] = \frac{1}{p} + \frac{1}{1-p} = \frac{1}{p(1-p)}
  6. 6

    Jeffreys prior

    Take square root:

    πJ(p)1p(1p)=p1/2(1p)1/2Beta(1/2,1/2)\pi_J(p) \propto \sqrt{\frac{1}{p(1-p)}} = p^{-1/2}(1-p)^{-1/2} \propto \text{Beta}(1/2, 1/2)

Key Insight:

Jeffreys prior Beta(1/2, 1/2) is a reference prior that is invariant under reparametrization. It places more weight near 0 and 1 than the uniform prior, reflecting that extreme probabilities may be more common in practice.

Bayesian Inference for Normal Mean and Variance

Problem:

Given data from N(μ, σ²) with both unknown, use conjugate Normal-Inverse-Gamma prior. With prior μ|σ² ~ N(0, σ²/κ₀) and σ² ~ Inv-Gamma(ν₀/2, ν₀σ₀²/2), derive the posterior.

Solution:

  1. 1

    Joint prior specification

    The Normal-Inverse-Gamma prior is conjugate for (μ, σ²):

    π(μ,σ2)=π(μσ2)π(σ2)=N(μ0,σ2/κ0)Inv-Gamma(ν0/2,ν0σ02/2)\pi(\mu, \sigma^2) = \pi(\mu|\sigma^2) \cdot \pi(\sigma^2) = N(\mu_0, \sigma^2/\kappa_0) \cdot \text{Inv-Gamma}(\nu_0/2, \nu_0\sigma_0^2/2)
  2. 2

    Set prior hyperparameters

    Choose weakly informative priors: κ₀ = 1, μ₀ = 0, ν₀ = 2, σ₀² = 1:

    κ0=1,μ0=0,ν0=2,σ02=1\kappa_0 = 1, \quad \mu_0 = 0, \quad \nu_0 = 2, \quad \sigma_0^2 = 1
  3. 3

    Posterior parameters for μ

    The posterior for μ|σ², data is Normal with updated parameters:

    μσ2,dataN(κ0μ0+nxˉκ0+n,σ2κ0+n)\mu|\sigma^2, \text{data} \sim N\left(\frac{\kappa_0 \mu_0 + n\bar{x}}{\kappa_0 + n}, \frac{\sigma^2}{\kappa_0 + n}\right)
  4. 4

    Posterior parameters for σ²

    The marginal posterior for σ² is Inverse-Gamma:

    σ2dataInv-Gamma(ν0+n2,ν0σ02+(n1)s2+κ0nκ0+n(xˉμ0)22)\sigma^2|\text{data} \sim \text{Inv-Gamma}\left(\frac{\nu_0 + n}{2}, \frac{\nu_0 \sigma_0^2 + (n-1)s^2 + \frac{\kappa_0 n}{\kappa_0 + n}(\bar{x} - \mu_0)^2}{2}\right)
  5. 5

    Example calculation

    With n=20, x̄=5, s²=4 and prior κ₀=1, μ₀=0, ν₀=2, σ₀²=1:

    μn=0+20(5)21=4.76,κn=21,νn=22\mu_n = \frac{0 + 20(5)}{21} = 4.76, \quad \kappa_n = 21, \quad \nu_n = 22
  6. 6

    Marginal posterior for μ

    Integrating out σ², the marginal posterior for μ is Student-t:

    μdatatνn(μn,scale2κn)\mu|\text{data} \sim t_{\nu_n}\left(\mu_n, \frac{\text{scale}^2}{\kappa_n}\right)

Key Insight:

The Normal-Inverse-Gamma conjugate prior allows joint inference on mean and variance. The posterior mean shrinks the sample mean toward the prior mean, with the degree of shrinkage depending on κ₀ and n.

Empirical Bayes Estimation

Problem:

In a meta-analysis of 8 clinical trials, observed effect sizes are: 0.5, 0.8, 0.3, 1.2, 0.6, 0.4, 0.9, 0.7 with known within-study variance σ² = 0.1. Use empirical Bayes to estimate the true effects and the between-study variance τ².

Solution:

  1. 1

    Hierarchical model

    Assume yᵢ|θᵢ ~ N(θᵢ, σ²) and θᵢ ~ N(μ, τ²):

    yiθiN(θi,σ2),θiN(μ,τ2)y_i | \theta_i \sim N(\theta_i, \sigma^2), \quad \theta_i \sim N(\mu, \tau^2)
  2. 2

    Marginal distribution

    Marginally, yᵢ ~ N(μ, σ² + τ²):

    yiN(μ,σ2+τ2)y_i \sim N(\mu, \sigma^2 + \tau^2)
  3. 3

    Estimate μ and τ² from data

    Use method of moments or ML. Sample mean and between-study variance:

    μ^=yˉ=0.675,τ^2=max(0,(yiyˉ)2n1σ2)\hat{\mu} = \bar{y} = 0.675, \quad \hat{\tau}^2 = \max\left(0, \frac{\sum(y_i - \bar{y})^2}{n-1} - \sigma^2\right)
  4. 4

    Calculate τ²

    Sample variance of effects is 0.082. Subtract within-study variance:

    τ^2=max(0,0.0820.1)=max(0,0.018)=0 or use ML: τ^20.05\hat{\tau}^2 = \max(0, 0.082 - 0.1) = \max(0, -0.018) = 0 \text{ or use ML: } \hat{\tau}^2 \approx 0.05
  5. 5

    Shrinkage estimator

    The empirical Bayes estimate for θᵢ shrinks yᵢ toward μ̂:

    θ^iEB=τ2σ2+τ2yi+σ2σ2+τ2μ^=Byi+(1B)μ^\hat{\theta}_i^{EB} = \frac{\tau^2}{\sigma^2 + \tau^2} y_i + \frac{\sigma^2}{\sigma^2 + \tau^2} \hat{\mu} = B \cdot y_i + (1-B) \cdot \hat{\mu}
  6. 6

    Calculate shrinkage

    With τ² = 0.05 and σ² = 0.1, shrinkage factor B = 0.05/(0.15) = 1/3:

    θ^iEB=13yi+23(0.675)\hat{\theta}_i^{EB} = \frac{1}{3} y_i + \frac{2}{3} (0.675)
  7. 7

    Example shrinkage

    For y₄ = 1.2 (largest): θ̂₄ = (1/3)(1.2) + (2/3)(0.675) = 0.85:

    θ^4EB=0.4+0.45=0.85 (shrunk from 1.2 toward 0.675)\hat{\theta}_4^{EB} = 0.4 + 0.45 = 0.85 \text{ (shrunk from 1.2 toward 0.675)}

Key Insight:

Empirical Bayes estimates the hyperparameters (μ, τ²) from the data itself, then uses these to construct a "pseudo-posterior" for each θᵢ. This provides adaptive shrinkage: extreme observations are pulled toward the overall mean.

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is the fundamental philosophical difference between Bayesian and Frequentist approaches?
2
In Bayes' theorem π(θx)L(xθ)π(θ)\pi(\theta|x) \propto L(x|\theta)\pi(\theta), what is π(θ)\pi(\theta)?
3
For a Beta(a,b) prior and Binomial(n,x) likelihood, what is the posterior distribution?
4
A 95% Bayesian credible interval means:
5
What is the posterior mean under squared error loss?
6
For a Poisson(λ\lambda) likelihood with Gamma(α\alpha, β\beta) prior, the posterior distribution is:
7
What does 'conjugate prior' mean?
8
The Jeffreys prior for a Bernoulli parameter p is:
9
What is the posterior predictive distribution used for?
10
Doob's Posterior Consistency Theorem states that:

Bayesian vs Classical Statistics

Understanding the fundamental philosophical and practical differences

Bayesian vs Classical Statistics
Understanding the fundamental philosophical and practical differences
AspectClassicalBayesianBayesian Advantage
Parameter NatureFixed unknown constantRandom variable with distributionNatural uncertainty representation
Information UsedSample data onlyPrior knowledge + sample dataIncorporates domain expertise
Interval Interpretation95% of intervals contain parameter95% probability parameter in intervalDirect probability statement
Small Sample PerformanceMay have poor coverageStabilized by prior informationBetter finite-sample properties
Sequential AnalysisRequires stopping rulesNatural updating frameworkFlexible data collection

Frequently Asked Questions

Common questions about Bayesian statistics and inference

What is the fundamental difference between Bayesian and Frequentist statistics?
The core difference lies in how they treat parameters: Bayesian statistics treats parameters θ as random variables with probability distributions, allowing probability statements about parameters (e.g., "P(θ > 0.5 | data) = 0.8"). Frequentist statistics treats parameters as fixed unknown constants, with probability statements only about data (e.g., "95% of such intervals will contain the true θ"). Bayesian naturally incorporates prior information, while frequentist relies solely on current sample data.
Key Point: Parameter randomness: Bayesian (random) vs Frequentist (fixed)
What is a Prior Distribution and how to choose it?
A prior distribution π(θ) represents our knowledge or belief about the parameter before observing data. Choice methods include: (1) **Informative prior**: based on historical data or expert knowledge; (2) **Weakly informative prior**: constrains θ to reasonable ranges; (3) **Non-informative prior**: uniform or Jeffreys prior with minimal influence. Choice should balance objectivity with incorporating genuine prior knowledge.
π(θx)L(xθ)π(θ)\pi(\theta|x) \propto L(x|\theta)\pi(\theta)
Why are Conjugate Priors so important?
Conjugate priors make the posterior distribution belong to the same family as the prior, yielding closed-form analytical solutions. This greatly simplifies computation, avoiding complex numerical integration. Classic examples: Beta-Binomial, Gamma-Poisson, Normal-Normal. Conjugate priors remain valuable for understanding Bayesian updating mechanisms and quick approximate inference.
Key Point: Analytical tractability + Intuitive interpretation
What is the difference between Credible Intervals and Confidence Intervals?
Credible intervals are Bayesian concepts: "P(θ ∈ [a,b] | data) = 0.95" means "given observed data, there's 95% probability θ is in this interval." Confidence intervals are frequentist: "95% confidence interval" means "95% of such intervals constructed from repeated samples will contain the true θ." Credible intervals allow direct probability statements about parameters, more aligned with intuitive interpretation.
Key Point: Direct probability interpretation
What is MCMC and why is it crucial in Bayesian analysis?
MCMC (Markov Chain Monte Carlo) generates samples from the posterior distribution π(θ|x) through random walks. When analytical solutions are unavailable (non-conjugate priors, complex models), MCMC enables approximate inference. Common algorithms: Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo (HMC). Modern tools like Stan, PyMC, JAGS make MCMC accessible.
Key Point: Enables complex Bayesian inference
Ask AI ✨
MathIsimple – Simple, Friendly Math Tools & Learning