MathIsimple
Back to Mathematical Statistics
Point Estimation
5-7 Hours

Point Estimation Theory

Master parameter estimation methods and their optimality properties

Learning Objectives
What you'll master in point estimation theory
  • Master fundamental concepts of point estimation theory and evaluation criteria
  • Understand Method of Moments, Maximum Likelihood Estimation methods
  • Learn Uniformly Minimum Variance Unbiased Estimators (UMVUE) construction
  • Apply Cramér-Rao inequality and Fisher information in efficiency analysis
  • Analyze estimator properties: unbiasedness, efficiency, consistency
  • Solve practical estimation problems in statistical inference

Estimation Methods

Three fundamental approaches to parameter estimation

Method of Moments (MOM)
Equate sample moments to population moments

The Method of Moments estimates parameters by setting sample moments equal to population moments and solving for parameters.

μk=E[Xk],an,k=1ni=1nXik\mu_k = E[X^k], \quad a_{n,k} = \frac{1}{n}\sum_{i=1}^n X_i^k

Population Moment

μk=E[Xk]\mu_k = E[X^k]

Sample Moment

an,k=1ni=1nXika_{n,k} = \frac{1}{n}\sum_{i=1}^n X_i^k

Estimation Equation

μk(θ)=an,k\mu_k(\theta) = a_{n,k}

Consistency

Consistent under regularity

When to Use:
  • Quick initial estimates when moments are easy to calculate
  • Starting values for iterative MLE algorithms
  • Distributions where likelihood is complex
Example: Exponential Distribution MOM

Problem:

Given a sample from Exp(λ)\text{Exp}(\lambda), find the Method of Moments estimator for λ\lambda.

Solution:

  1. Population first moment: μ1=E[X]=1λ\mu_1 = E[X] = \frac{1}{\lambda}
  2. Sample first moment: an,1=Xˉ=1ni=1nXia_{n,1} = \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i
  3. Set equal: 1λ=Xˉ\frac{1}{\lambda} = \bar{X}
  4. Solve for parameter:
    λ^MOM=1Xˉ\hat{\lambda}_{\text{MOM}} = \frac{1}{\bar{X}}

Key Insight:

MOM is intuitive: match observed sample characteristics to theoretical population characteristics. For exponential, the sample mean estimates 1/λ1/\lambda, so invert to get λ^\hat{\lambda}.

Maximum Likelihood Estimation (MLE)
Find parameters that maximize the probability of observed data

MLE finds the parameter value that makes the observed data most likely. It's the gold standard for point estimation due to optimal asymptotic properties.

L(θ;x)=i=1nf(xi;θ),(θ)=logL(θ)L(\theta; x) = \prod_{i=1}^n f(x_i; \theta), \quad \ell(\theta) = \log L(\theta)

Likelihood Function

L(θ)=i=1nf(xi;θ)L(\theta) = \prod_{i=1}^n f(x_i;\theta)

Log-Likelihood

(θ)=i=1nlogf(xi;θ)\ell(\theta) = \sum_{i=1}^n \log f(x_i;\theta)

Score Function

S(θ)=θ=0S(\theta) = \frac{\partial \ell}{\partial \theta} = 0

Invariance

g(θ^)g(\hat{\theta}) is MLE of g(θ)g(\theta)

Asymptotic Properties:
  • Consistency: θ^nPθ0\hat{\theta}_n \xrightarrow{P} \theta_0
  • Asymptotic Normality: n(θ^θ)dN(0,I1(θ))\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} N(0, I^{-1}(\theta))
  • Efficiency: Achieves Cramér-Rao lower bound
Example: Normal Distribution MLE

Problem:

Find the MLE of μ\mu and σ2\sigma^2 for sample from N(μ,σ2)N(\mu, \sigma^2).

Solution:

  1. Log-likelihood:
    =n2log(2πσ2)12σ2(xiμ)2\ell = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(x_i-\mu)^2
  2. Differentiate w.r.t. μ\mu: μ=1σ2(xiμ)=0\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum(x_i - \mu) = 0
  3. Solve: μ^=Xˉ\hat{\mu} = \bar{X}
  4. Differentiate w.r.t. σ2\sigma^2 and solve: σ^2=1n(XiXˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2

Key Insight:

MLE for μ\mu is unbiased, but σ^2\hat{\sigma}^2 is biased (uses nn not n1n-1). Bias vanishes as nn \to \infty.

Evaluation Criteria

How to judge the quality of estimators

Key Properties of Estimators

Unbiasedness

E[θ^]=θE[\hat{\theta}] = \theta

On average, estimator equals true value

Efficiency

Var(θ^) is minimal\text{Var}(\hat{\theta}) \text{ is minimal}

Smallest variance among unbiased estimators

Consistency

θ^nPθ\hat{\theta}_n \xrightarrow{P} \theta

Converges to true value as nn \to \infty

Mean Squared Error

MSE=E[(θ^θ)2]\text{MSE} = E[(\hat{\theta} - \theta)^2]

Combines bias and variance

Bias-Variance Decomposition:

MSE(θ^)=Var(θ^)+[Bias(θ^)]2\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2
Example: Sample Variance Bias

Problem:

Compare Sn2=1n(XiXˉ)2S_n^2 = \frac{1}{n}\sum(X_i-\bar{X})^2 vs S2=1n1(XiXˉ)2S^2 = \frac{1}{n-1}\sum(X_i-\bar{X})^2 for estimating σ2\sigma^2.

Analysis:

  1. MLE estimator Sn2S_n^2: E[Sn2]=n1nσ2E[S_n^2] = \frac{n-1}{n}\sigma^2 (biased)
  2. Unbiased estimator S2S^2: E[S2]=σ2E[S^2] = \sigma^2
  3. Bias of Sn2S_n^2: Bias=σ2n0\text{Bias} = -\frac{\sigma^2}{n} \to 0 as nn \to \infty
  4. Both have same variance (up to scaling)
  5. For large nn, difference negligible

Key Insight:

MLE may be biased in finite samples but asymptotically unbiased. Use n1n-1 for exact unbiasedness, nn for MLE consistency.

Cramér-Rao Lower Bound

Fundamental limit on estimator variance

The CRLB Theorem

For any unbiased estimator θ^\hat{\theta} of θ\theta, the variance satisfies:

Var(θ^)1nI(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}

where I(θ)I(\theta) is the Fisher information

Fisher Information

I(θ)=E[(logfθ)2]I(\theta) = E\left[\left(\frac{\partial \log f}{\partial \theta}\right)^2\right]

Alternative Form

I(θ)=E[2logfθ2]I(\theta) = -E\left[\frac{\partial^2 \log f}{\partial \theta^2}\right]

Efficiency

e(θ^)=1/(nI(θ))Var(θ^)e(\hat{\theta}) = \frac{1/(nI(\theta))}{\text{Var}(\hat{\theta})}

Efficient Estimator

Achieves CRLB: e=1e = 1

Example: Fisher Information for Exponential

Problem:

Find the Fisher information and CRLB for λ\lambda in Exp(λ)\text{Exp}(\lambda).

Solution:

  1. PDF: f(x;λ)=λeλxf(x;\lambda) = \lambda e^{-\lambda x}
  2. Log-likelihood: logf=logλλx\log f = \log\lambda - \lambda x
  3. Score: logfλ=1λx\frac{\partial \log f}{\partial \lambda} = \frac{1}{\lambda} - x
  4. Fisher information:
    I(λ)=E[(1λX)2]=1λ2I(\lambda) = E\left[\left(\frac{1}{\lambda} - X\right)^2\right] = \frac{1}{\lambda^2}
  5. CRLB: Var(λ^)λ2n\text{Var}(\hat{\lambda}) \geq \frac{\lambda^2}{n}

Key Insight:

The sample mean Xˉ\bar{X} leads to λ^=1/Xˉ\hat{\lambda} = 1/\bar{X} with variance λ2/n\lambda^2/n, achieving the CRLB (efficient estimator).

Rigorous Theorem Proofs

Step-by-step mathematical derivations of fundamental estimation theorems

Proof: Cramér-Rao Inequality
For unbiased estimators, variance is bounded below by Fisher information

Theorem Statement:

Let θ^(X1,,Xn)\hat{\theta}(X_1,\ldots,X_n) be an unbiased estimator of θ\theta. Under regularity conditions, the variance satisfies:

Var(θ^)1nI(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}

where I(θ)=E[(logfθ)2]I(\theta) = E\left[\left(\frac{\partial \log f}{\partial \theta}\right)^2\right] is the Fisher information.

Proof:

  1. Step 1: Define the score function S(θ)=θlogL(θ)S(\theta) = \frac{\partial}{\partial \theta} \log L(\theta)
    S(θ)=i=1nθlogf(Xi;θ)S(\theta) = \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(X_i; \theta)
    Note: E[S(θ)]=0E[S(\theta)] = 0 (score has zero mean)
  2. Step 2: Since θ^\hat{\theta} is unbiased, E[θ^]=θE[\hat{\theta}] = \theta. Differentiate both sides:
    θθ^(x)L(θ;x)dx=1\frac{\partial}{\partial \theta} \int \hat{\theta}(x) L(\theta;x) dx = 1
    θ^(x)Lθdx=1\int \hat{\theta}(x) \frac{\partial L}{\partial \theta} dx = 1
    θ^(x)logLθL(θ;x)dx=1\int \hat{\theta}(x) \frac{\partial \log L}{\partial \theta} L(\theta;x) dx = 1
    Therefore: E[θ^S(θ)]=1E[\hat{\theta} \cdot S(\theta)] = 1
  3. Step 3: Apply Cauchy-Schwarz inequality:
    [E[θ^S]]2E[θ^2]E[S2][E[\hat{\theta} \cdot S]]^2 \leq E[\hat{\theta}^2] \cdot E[S^2]
    Substituting E[θ^S]=1E[\hat{\theta} \cdot S] = 1:
    1E[θ^2]E[S2]1 \leq E[\hat{\theta}^2] \cdot E[S^2]
  4. Step 4: Since E[θ^]=θE[\hat{\theta}] = \theta:
    E[θ^2]=Var(θ^)+θ2E[\hat{\theta}^2] = \text{Var}(\hat{\theta}) + \theta^2
    And E[S2]=nI(θ)E[S^2] = nI(\theta) (Fisher information for nn observations)
  5. Step 5: For E[S]=0E[S] = 0, we can use Cov(θ^,S)\text{Cov}(\hat{\theta}, S):
    1=E[θ^S]E[θ^]E[S]=Cov(θ^,S)1 = E[\hat{\theta} \cdot S] - E[\hat{\theta}]E[S] = \text{Cov}(\hat{\theta}, S)
    By Cauchy-Schwarz for covariance:
    [Cov(θ^,S)]2Var(θ^)Var(S)[\text{Cov}(\hat{\theta}, S)]^2 \leq \text{Var}(\hat{\theta}) \cdot \text{Var}(S)
    1Var(θ^)nI(θ)1 \leq \text{Var}(\hat{\theta}) \cdot nI(\theta)
  6. Step 6: Rearranging gives the Cramér-Rao bound:
    Var(θ^)1nI(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)} \quad \blacksquare

Regularity Conditions:

  • Support of f(x;θ)f(x;\theta) does not depend on θ\theta
  • Can interchange differentiation and integration
  • Fisher information I(θ)>0I(\theta) > 0 and finite
Detailed Example: Poisson MLE with Complete Derivation

Problem:

Given X1,,XnP(λ)X_1, \ldots, X_n \sim P(\lambda) i.i.d., find the MLE of λ\lambda and verify it achieves the CRLB.

Solution:

  1. Likelihood function:
    L(λ)=i=1nλxieλxi!L(\lambda) = \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!}
    =λxienλxi!= \frac{\lambda^{\sum x_i} e^{-n\lambda}}{\prod x_i!}
  2. Log-likelihood:
    (λ)=logL(λ)=i=1nxilogλnλlog(xi!)\ell(\lambda) = \log L(\lambda) = \sum_{i=1}^n x_i \log \lambda - n\lambda - \sum \log(x_i!)
    =(xi)logλnλconst= \left(\sum x_i\right) \log \lambda - n\lambda - \text{const}
  3. Score function (first derivative):
    S(λ)=λ=xiλnS(\lambda) = \frac{\partial \ell}{\partial \lambda} = \frac{\sum x_i}{\lambda} - n
  4. Set score to zero:
    xiλ^n=0\frac{\sum x_i}{\hat{\lambda}} - n = 0
    λ^=xin=Xˉ\hat{\lambda} = \frac{\sum x_i}{n} = \bar{X}
  5. Verify second derivative (maximum):
    2λ2=xiλ2<0\frac{\partial^2 \ell}{\partial \lambda^2} = -\frac{\sum x_i}{\lambda^2} < 0
    Therefore λ^=Xˉ\hat{\lambda} = \bar{X} is indeed a maximum.
  6. Fisher Information:
    I(λ)=E[2logfλ2]I(\lambda) = -E\left[\frac{\partial^2 \log f}{\partial \lambda^2}\right]
    =E[Xλ2]=E[X]λ2=λλ2=1λ= -E\left[-\frac{X}{\lambda^2}\right] = \frac{E[X]}{\lambda^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}
    For nn observations: nI(λ)=nλnI(\lambda) = \frac{n}{\lambda}
  7. CRLB:
    Var(λ^)1nI(λ)=λn\text{Var}(\hat{\lambda}) \geq \frac{1}{nI(\lambda)} = \frac{\lambda}{n}
  8. Actual variance of MLE:
    Var(Xˉ)=Var(X1)n=λn\text{Var}(\bar{X}) = \frac{\text{Var}(X_1)}{n} = \frac{\lambda}{n}
    The MLE achieves the CRLB exactly (efficient estimator)!

Key Insight:

The sample mean Xˉ\bar{X} is the MLE for Poisson λ\lambda, and it's efficient (achieves CRLB). This demonstrates why MLE is optimal: it achieves the theoretical lower bound on variance.

Proof: MLE Asymptotic Normality
Under regularity conditions, MLE is asymptotically normal with optimal variance

Theorem Statement:

Let θ^n\hat{\theta}_n be the MLE of θ\theta based on nn i.i.d. observations. Under regularity conditions:

n(θ^nθ0)dN(0,I1(θ0))\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, I^{-1}(\theta_0))

where I(θ)I(\theta) is the Fisher information and θ0\theta_0 is the true parameter value.

Proof:

  1. Step 1 (Score Equation): The MLE θ^n\hat{\theta}_n satisfies the score equation:
    Sn(θ^n)=i=1nlogf(Xi;θ^n)θ=0S_n(\hat{\theta}_n) = \sum_{i=1}^n \frac{\partial \log f(X_i; \hat{\theta}_n)}{\partial \theta} = 0
  2. Step 2 (Taylor Expansion): Expand the score around the true value θ0\theta_0:
    Sn(θ^n)=Sn(θ0)+Sn(θ)(θ^nθ0)S_n(\hat{\theta}_n) = S_n(\theta_0) + S_n'(\theta^*)(\hat{\theta}_n - \theta_0)
    where θ\theta^* lies between θ^n\hat{\theta}_n and θ0\theta_0, and:
    Sn(θ)=i=1n2logf(Xi;θ)θ2S_n'(\theta) = \sum_{i=1}^n \frac{\partial^2 \log f(X_i; \theta)}{\partial \theta^2}
  3. Step 3 (Rearrange for MLE): Since Sn(θ^n)=0S_n(\hat{\theta}_n) = 0:
    0=Sn(θ0)+Sn(θ)(θ^nθ0)0 = S_n(\theta_0) + S_n'(\theta^*)(\hat{\theta}_n - \theta_0)
    Solving for θ^nθ0\hat{\theta}_n - \theta_0:
    θ^nθ0=Sn(θ0)Sn(θ)\hat{\theta}_n - \theta_0 = -\frac{S_n(\theta_0)}{S_n'(\theta^*)}
  4. Step 4 (Normalize Both Sides): Multiply by n\sqrt{n}:
    n(θ^nθ0)=Sn(θ0)/nSn(θ)/n\sqrt{n}(\hat{\theta}_n - \theta_0) = -\frac{S_n(\theta_0)/\sqrt{n}}{S_n'(\theta^*)/n}
  5. Step 5 (Apply CLT to Numerator): By the Central Limit Theorem, the score at θ0\theta_0 satisfies:
    Sn(θ0)n=1ni=1nlogf(Xi;θ0)θdN(0,I(θ0))\frac{S_n(\theta_0)}{\sqrt{n}} = \frac{1}{\sqrt{n}} \sum_{i=1}^n \frac{\partial \log f(X_i; \theta_0)}{\partial \theta} \xrightarrow{d} N(0, I(\theta_0))
    since E[logf/θ]=0E[\partial \log f/\partial \theta] = 0 and Var(logf/θ)=I(θ)\text{Var}(\partial \log f/\partial \theta) = I(\theta).
  6. Step 6 (Apply LLN to Denominator): By the Law of Large Numbers:
    Sn(θ)n=1ni=1n2logf(Xi;θ)θ2\frac{S_n'(\theta^*)}{n} = \frac{1}{n} \sum_{i=1}^n \frac{\partial^2 \log f(X_i; \theta^*)}{\partial \theta^2}
    By consistency of MLE, θθ0\theta^* \to \theta_0, so:
    Sn(θ)nPE[2logfθ2]=I(θ0)\frac{S_n'(\theta^*)}{n} \xrightarrow{P} E\left[\frac{\partial^2 \log f}{\partial \theta^2}\right] = -I(\theta_0)
  7. Step 7 (Combine via Slutsky's Theorem): By Slutsky's theorem:
    n(θ^nθ0)=Sn(θ0)/nSn(θ)/ndN(0,I(θ0))I(θ0)\sqrt{n}(\hat{\theta}_n - \theta_0) = -\frac{S_n(\theta_0)/\sqrt{n}}{S_n'(\theta^*)/n} \xrightarrow{d} \frac{N(0, I(\theta_0))}{-I(\theta_0)}
  8. Step 8 (Conclude Asymptotic Normality): Simplifying the limiting distribution:
    n(θ^nθ0)dN(0,I(θ0)I2(θ0))=N(0,I1(θ0))\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N\left(0, \frac{I(\theta_0)}{I^2(\theta_0)}\right) = N(0, I^{-1}(\theta_0)) \quad \blacksquare
    This shows MLE achieves the Cramér-Rao lower bound asymptotically.

Regularity Conditions:

  • True parameter θ0\theta_0 is an interior point of parameter space
  • Likelihood is three times differentiable with respect to θ\theta
  • Fisher information 0<I(θ)<0 < I(\theta) < \infty for all θ\theta
  • Interchange of differentiation and integration is valid
Rao-Blackwell Theorem
Improving estimators using sufficient statistics

Theorem Statement:

Let θ^\hat{\theta} be an unbiased estimator and TT a sufficient statistic. Define:

θ^=E[θ^T]\hat{\theta}^* = E[\hat{\theta} \mid T]

Then θ^\hat{\theta}^* is also unbiased and Var(θ^)Var(θ^)\text{Var}(\hat{\theta}^*) \leq \text{Var}(\hat{\theta}).

Proof:

  1. Step 1 (Verify Unbiasedness): We first show that θ^=E[θ^T]\hat{\theta}^* = E[\hat{\theta} \mid T] is unbiased. Using the tower property of conditional expectation:
    E[θ^]=E[E[θ^T]]E[\hat{\theta}^*] = E[E[\hat{\theta} \mid T]]
    By the law of iterated expectations:
    E[E[θ^T]]=E[θ^]E[E[\hat{\theta} \mid T]] = E[\hat{\theta}]
    Since θ^\hat{\theta} is unbiased for θ\theta, we have E[θ^]=θE[\hat{\theta}] = \theta, thus:
    E[θ^]=θE[\hat{\theta}^*] = \theta
  2. Step 2 (Law of Total Variance): Recall the variance decomposition formula:
    Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = E[\text{Var}(X \mid Y)] + \text{Var}(E[X \mid Y])
    Applying this to θ^\hat{\theta} conditioned on TT:
    Var(θ^)=E[Var(θ^T)]+Var(E[θ^T])\text{Var}(\hat{\theta}) = E[\text{Var}(\hat{\theta} \mid T)] + \text{Var}(E[\hat{\theta} \mid T])
  3. Step 3 (Substitute Improved Estimator): Recognize that by definition:
    E[θ^T]=θ^E[\hat{\theta} \mid T] = \hat{\theta}^*
    Substituting into the variance decomposition:
    Var(θ^)=E[Var(θ^T)]+Var(θ^)\text{Var}(\hat{\theta}) = E[\text{Var}(\hat{\theta} \mid T)] + \text{Var}(\hat{\theta}^*)
  4. Step 4 (Non-negativity of Conditional Variance): By fundamental properties of variance, conditional variance is always non-negative:
    Var(θ^T)0for all T\text{Var}(\hat{\theta} \mid T) \geq 0 \quad \text{for all } T
    Taking expectations on both sides:
    E[Var(θ^T)]0E[\text{Var}(\hat{\theta} \mid T)] \geq 0
  5. Step 5 (Derive Variance Inequality): From Step 3, rearrange to isolate Var(θ^)\text{Var}(\hat{\theta}^*):
    Var(θ^)=Var(θ^)E[Var(θ^T)]\text{Var}(\hat{\theta}^*) = \text{Var}(\hat{\theta}) - E[\text{Var}(\hat{\theta} \mid T)]
    Since E[Var(θ^T)]0E[\text{Var}(\hat{\theta} \mid T)] \geq 0 from Step 4:
    Var(θ^)Var(θ^)\text{Var}(\hat{\theta}^*) \leq \text{Var}(\hat{\theta})
  6. Step 6 (Characterize Equality): Equality holds when:
    E[Var(θ^T)]=0E[\text{Var}(\hat{\theta} \mid T)] = 0
    Since Var(θ^T)0\text{Var}(\hat{\theta} \mid T) \geq 0, this requires:
    Var(θ^T)=0almost surely\text{Var}(\hat{\theta} \mid T) = 0 \quad \text{almost surely}
  7. Step 7 (Zero Variance Implies Constant): A random variable with zero conditional variance is constant (given the conditioning variable):
    Var(θ^T)=0θ^=E[θ^T]=θ^\text{Var}(\hat{\theta} \mid T) = 0 \quad \Rightarrow \quad \hat{\theta} = E[\hat{\theta} \mid T] = \hat{\theta}^*
    This means θ^\hat{\theta} is already a function of the sufficient statistic TT alone.
  8. Step 8 (Conclusion): We have proven:
    E[θ^]=θandVar(θ^)Var(θ^)E[\hat{\theta}^*] = \theta \quad \text{and} \quad \text{Var}(\hat{\theta}^*) \leq \text{Var}(\hat{\theta})
    with equality if and only if θ^\hat{\theta} is already based on TT alone. \quad \blacksquare

Practical Use:

Start with any unbiased estimator θ^\hat{\theta}, condition on a sufficient statistic TT to get θ^\hat{\theta}^* with lower (or equal) variance. This process is called Rao-Blackwellization.

Example: Improving Estimator via Rao-Blackwell

Problem:

For X1,,XnExponential(λ)X_1, \ldots, X_n \sim \text{Exponential}(\lambda), start with λ^1=1/X1\hat{\lambda}_1 = 1/X_1 (unbiased). Use Rao-Blackwell to improve it with sufficient statistic T=XiT = \sum X_i.

Solution:

  1. Verify unbiasedness of initial estimator:
    E[1/X1]=01xλeλxdxE[1/X_1] = \int_0^\infty \frac{1}{x} \lambda e^{-\lambda x} dx
    Using integration by parts or direct calculation: E[1/X1]=λE[1/X_1] = \lambda (unbiased)
  2. Identify sufficient statistic: T=i=1nXiΓ(n,λ)T = \sum_{i=1}^n X_i \sim \Gamma(n, \lambda)
  3. Apply Rao-Blackwell:
    λ^=E[1/X1T]\hat{\lambda}^* = E[1/X_1 \mid T]
    By symmetry, X1,,XnX_1, \ldots, X_n are exchangeable given TT:
    E[1/XiT]=E[1/XjT] for all i,jE[1/X_i \mid T] = E[1/X_j \mid T] \text{ for all } i,j
  4. Use linearity:
    nE[1/X1T]=E[i=1n1/XiT]n \cdot E[1/X_1 \mid T] = E\left[\sum_{i=1}^n 1/X_i \mid T\right]
    The improved estimator is:
    λ^=1nE[1/XiT]\hat{\lambda}^* = \frac{1}{n} E\left[\sum 1/X_i \mid T\right]
  5. For exponential family: It can be shown that:
    λ^=nT=nXi=1Xˉ\hat{\lambda}^* = \frac{n}{T} = \frac{n}{\sum X_i} = \frac{1}{\bar{X}}
    This is the MLE!
  6. Variance comparison:
    Var(1/X1)=(infinite variance!)\text{Var}(1/X_1) = \infty \quad \text{(infinite variance!)}
    Var(1/Xˉ)=λ2n(finite, achieves CRLB)\text{Var}(1/\bar{X}) = \frac{\lambda^2}{n} \quad \text{(finite, achieves CRLB)}

Key Insight:

Rao-Blackwell transforms a crude unbiased estimator (with infinite variance!) into an efficient estimator (MLE). Always condition on sufficient statistics to improve estimators.

Proof: Lehmann-Scheffé Theorem
Completeness + Sufficiency + Unbiasedness yields unique UMVUE

Theorem Statement:

Let TT be a complete sufficient statistic for θ\theta. If θ^=g(T)\hat{\theta} = g(T) is an unbiased estimator based solely on TT, then θ^\hat{\theta} is the unique UMVUE (Uniformly Minimum Variance Unbiased Estimator) of θ\theta.

Proof:

  1. Step 1 (Strategy): Suppose θ~\tilde{\theta} is any other unbiased estimator of θ\theta. We will show that Var(θ^)Var(θ~)\text{Var}(\hat{\theta}) \leq \text{Var}(\tilde{\theta}) with equality only when θ~=θ^\tilde{\theta} = \hat{\theta}.
  2. Step 2 (Apply Rao-Blackwell): By Rao-Blackwell theorem, define:
    θ~=E[θ~T]\tilde{\theta}^* = E[\tilde{\theta} \mid T]
    Then θ~\tilde{\theta}^* is also unbiased and:
    Var(θ~)Var(θ~)\text{Var}(\tilde{\theta}^*) \leq \text{Var}(\tilde{\theta})
  3. Step 3 (Function of Sufficient Statistic): Since θ~=E[θ~T]\tilde{\theta}^* = E[\tilde{\theta} \mid T], it is a function of TT alone, say:
    θ~=h(T)\tilde{\theta}^* = h(T)
    for some function hh.
  4. Step 4 (Both are Unbiased Functions of T): We now have two unbiased estimators based on TT:
    E[θ^]=E[g(T)]=θE[\hat{\theta}] = E[g(T)] = \theta
    E[θ~]=E[h(T)]=θE[\tilde{\theta}^*] = E[h(T)] = \theta
  5. Step 5 (Use Completeness): Consider their difference:
    E[θ^θ~]=E[g(T)h(T)]=θθ=0E[\hat{\theta} - \tilde{\theta}^*] = E[g(T) - h(T)] = \theta - \theta = 0
    Since TT is complete and g(T)h(T)g(T) - h(T) is a function of TT with expectation zero:
    P(g(T)h(T)=0)=1P(g(T) - h(T) = 0) = 1
    Therefore: θ^=θ~\hat{\theta} = \tilde{\theta}^* almost surely.
  6. Step 6 (Conclude Uniqueness): Since θ^=θ~\hat{\theta} = \tilde{\theta}^* and Var(θ~)Var(θ~)\text{Var}(\tilde{\theta}^*) \leq \text{Var}(\tilde{\theta}):
    Var(θ^)=Var(θ~)Var(θ~)\text{Var}(\hat{\theta}) = \text{Var}(\tilde{\theta}^*) \leq \text{Var}(\tilde{\theta})
    This holds for any unbiased estimator θ~\tilde{\theta}, so θ^\hat{\theta} has minimum variance among all unbiased estimators.
  7. Step 7 (Uniqueness of UMVUE): If there were another UMVUE θ~\tilde{\theta}', the same argument shows:
    E[θ~T]=θ^E[\tilde{\theta}' \mid T] = \hat{\theta}
    By completeness: θ~=θ^\tilde{\theta}' = \hat{\theta} almost surely. Thus the UMVUE is unique. \quad \blacksquare

Key Concepts:

  • Completeness: E[g(T)]=0 for all θP(g(T)=0)=1E[g(T)] = 0 \text{ for all } \theta \Rightarrow P(g(T) = 0) = 1
  • Sufficiency: TT contains all information about θ\theta
  • UMVUE Recipe: Find complete sufficient statistic \to Find unbiased function of it

Frequently Asked Questions

Common questions about point estimation

When should I use MLE vs. Method of Moments?

Use MLE when you need optimal asymptotic properties and can compute the likelihood. Use Method of Moments for quick estimates, complex likelihoods, or as starting values for iterative MLE. MLE is generally preferred for its efficiency and invariance property, but MOM is simpler and often provides good initial estimates.

What does it mean for an estimator to be "efficient"?

An estimator is efficient if it achieves the Cramér-Rao lower bound: Var(θ^)=1/(nI(θ))\text{Var}(\hat{\theta}) = 1/(nI(\theta)). This means no other unbiased estimator has lower variance. MLE is asymptotically efficient under regularity conditions. Efficiency matters because lower variance means more precise estimates from the same sample size.

Why is sample variance divided by n-1 instead of n?

Dividing by n1n-1 makes the estimator unbiased: E[S2]=σ2E[S^2] = \sigma^2. We "lose one degree of freedom" because we estimate the mean Xˉ\bar{X} from the same data. The MLE uses nn (biased) but the bias σ2/n0\sigma^2/n \to 0 vanishes for large samples. For small samples, use n1n-1 for unbiasedness.

What's the difference between consistency and unbiasedness?

Unbiasedness (E[θ^]=θE[\hat{\theta}] = \theta) is a finite-sample property: on average across repeated samples of size nn, the estimate equals the true value. Consistency (θ^nθ\hat{\theta}_n \to \theta) is an asymptotic property: as nn \to \infty, the estimate converges to the true value. An estimator can be biased but consistent (like MLE for σ2\sigma^2).

How do I compute Fisher information?

Two equivalent methods: (1) I(θ)=E[(logf/θ)2]I(\theta) = E[(\partial \log f/\partial \theta)^2] - expected squared score, or (2) I(θ)=E[2logf/θ2]I(\theta) = -E[\partial^2 \log f/\partial \theta^2] - negative expected Hessian. Often method (2) is easier. For nn i.i.d. observations, total information is nI(θ)nI(\theta). Fisher information measures how much information the data contains about θ\theta.

Can a biased estimator ever be better than an unbiased one?

Yes! By the bias-variance tradeoff, a slightly biased estimator with much lower variance can have smaller MSE: MSE=Var+Bias2\text{MSE} = \text{Var} + \text{Bias}^2. Examples include ridge regression and James-Stein estimator. However, for large samples, consistency becomes more important than finite-sample bias. MLE sacrifices exact unbiasedness for asymptotic optimality.

What is the invariance property of MLE?

If θ^\hat{\theta} is the MLE of θ\theta, then g(θ^)g(\hat{\theta}) is the MLE of g(θ)g(\theta) for any function gg. This is powerful: if λ^=1/Xˉ\hat{\lambda} = 1/\bar{X} is MLE for exponential λ\lambda, then 1/λ^=Xˉ1/\hat{\lambda} = \bar{X} is MLE for mean 1/λ1/\lambda. Method of Moments doesn't have this property.

How large does n need to be for asymptotic properties to hold?

No universal rule - depends on the distribution and parameter. For normal distributions, asymptotics work well even for n30n \approx 30. For skewed distributions, need n>100n > 100. For heavy-tailed distributions, may need n>200n > 200. Always verify with simulation or use exact finite-sample methods (e.g., t-distribution) when nn is small.