MathIsimple – Simple, Friendly Math Tools & Learning

Mean Estimation Theory

Foundations of parameter estimation and consistency analysis

Core Estimation Problem

Fundamental Principle

For AR, MA, and ARMA models, all parameters are uniquely determined by the autocovariance function. Therefore, the key to parameter identification is accurate estimation of $\gamma_k$ .

Step 1

Estimate sample mean $\bar{X}_N$

Step 2

Estimate autocovariance $\hat{\gamma}_k$

Step 3

Identify model parameters

Consistency Theorem

Theorem (Consistency of Sample Mean)

For a stationary sequence with $\gamma_m \to 0$ as $m \to \infty$ , the sample mean $\bar{X}_N$ is a consistent estimator of the population mean $\mu$ .

\bar{X}_N = \frac{1}{N}\sum_{t=1}^N X_t \xrightarrow{P} \mu

Proof Outline (5 Steps)

1

Mean Square Error Decomposition

$E(\bar{X}_N - \mu)^2 = \frac{1}{N^2}\sum_{k=1}^N\sum_{j=1}^N \gamma_{|k-j|}$

2

Index Transformation

Set $m = k-j$ , transform to $\frac{1}{N^2}\sum_{m=-(N-1)}^{N-1}(N-|m|)\gamma_m$

3

Upper Bound

Use $\frac{N-|m|}{N^2} \leq \frac{1}{N}$ to get $\frac{1}{N}\sum_{m=-(N-1)}^{N-1}|\gamma_m|$

4

Césaro Convergence

When $\gamma_m \to 0$ , Césaro average $\frac{1}{N}\sum|\gamma_m| \to 0$

5

Probability Convergence

Apply Chebyshev's inequality: $P(|\bar{X}_N-\mu| > \epsilon) \leq \frac{E(\bar{X}_N-\mu)^2}{\epsilon^2} \to 0$

Strong Consistency

For strictly stationary and ergodic sequences, the sample mean is strongly consistent:

\bar{X}_N \xrightarrow{a.s.} \mu \quad (N \to \infty)

This is a consequence of the ergodic theorem: time averages converge to ensemble averages.

Central Limit Theorem & Asymptotic Distribution

CLT for Linear Stationary Processes

Theorem Statement

For a linear stationary process $X_t = \mu + \sum_{k=-\infty}^{\infty}\psi_k\epsilon_{t-k}$ with:

• $\sum \psi_k^2 < \infty$ (square-summable coefficients)
•Spectral density $f(\lambda)$ continuous at $\lambda=0$ with $f(0) \neq 0$

\sqrt{N}(\bar{X}_N - \mu) \xrightarrow{d} N(0, 2\pi f(0))

Asymptotic Variance Calculation

Method 1: Direct computation

NE(\bar{X}_N-\mu)^2 \to \sum_{m=-\infty}^{\infty}\gamma_m = 2\pi f(0)

The limiting variance captures all temporal dependence through the autocovariance sum.

Spectral Representation

Method 2: Using Wold coefficients

2\pi f(0) = \sigma^2\left(\sum_{k=-\infty}^{\infty}\psi_k\right)^2

This shows the connection between spectral density and the MA(∞) representation.

Practical Applications

Confidence Intervals

$\mu \in \bar{X}_N \pm 1.96\sqrt{\frac{2\pi f(0)}{N}}$

Hypothesis Testing

Test $H_0: \mu = \mu_0$ using normal approximation

Forecast Intervals

Quantify prediction uncertainty

Convergence Speed & Law of Iterated Logarithm

Law of Iterated Logarithm (LIL)

Theoretical Foundation

The LIL provides a more precise characterization of convergence than the CLT. While CLT gives rate $O(1/\sqrt{N})$ , LIL gives the exact fluctuation bounds:

\text{Convergence rate: } O\left(\sqrt{\frac{2\ln\ln N}{N}}\right)

This is a refinement of the CLT: it describes not just the limiting distribution, but the "worst-case" behavior of the sample mean.

LIL Theorem for Linear Stationary Sequences

Conditions

1. $X_t = \mu + \sum_{k=-\infty}^{\infty}\psi_k\epsilon_{t-k}$ with $\sum|\psi_k| < \infty$
2. $f(0)$ continuous at 0, and $E|\epsilon_t|^r < \infty$ for some $r > 2$

Result

\limsup_{N\to\infty} \sqrt{\frac{N}{2\ln\ln N}}(\bar{X}_N - \mu) = \sqrt{2\pi f(0)}, \quad a.s.

\liminf_{N\to\infty} \sqrt{\frac{N}{2\ln\ln N}}(\bar{X}_N - \mu) = -\sqrt{2\pi f(0)}, \quad a.s.

Interpretation

The sample mean fluctuates infinitely often between the bounds $\pm\sqrt{2\pi f(0)}\cdot\sqrt{\frac{2\ln\ln N}{N}}$
Critical: $\sqrt{N}(\bar{X}_N-\mu)$ itself does not converge — it oscillates within precise bounds
From this: $(\bar{X}_N - \mu) = O\left(\sqrt{\frac{\ln\ln N}{N}}\right)$

Practical Value

The LIL tells us the minimum sample size needed for a given estimation precision:

To achieve error tolerance $\epsilon$ , we need approximately:

N \approx \frac{2\pi f(0) \cdot 2\ln\ln N}{\epsilon^2}

AR(2) Mean Calculation & Simulation

AR(2) Model Analysis

Model Definition

Consider an AR(2) model with complex characteristic roots:

A(z) = (1-\rho e^{i\theta}\cdot z)(1-\rho e^{-i\theta}\cdot z)

The AR(2) process is:

X_t = 2\rho\cos\theta \cdot X_{t-1} - \rho^2 X_{t-2} + \epsilon_t

where $\epsilon_t \stackrel{i.i.d.}{\sim} N(0,\sigma^2)$

Sample Mean Relationship

Taking average of both sides over t=1,...,N:

\bar{X}_N \approx \frac{1}{A(1)}\bar{\epsilon}_N = \frac{1}{1-2\rho\cos\theta+\rho^2}\bar{\epsilon}_N

Key insight: The sample mean of the AR(2) series is approximately proportional to the white noise sample mean, with proportionality constant $1/A(1)$ .

Simulation Results

Two parameter configurations were simulated with M=1000 replications:

Configuration 1

• $\rho = 1/1.1$ (close to unit root)
• $\theta = 2.34$
• Higher variance expected

Configuration 2

• $\rho = 1/4$ (more stable)
• $\theta = 2.34$
• Lower variance expected

N	10	20	40	100	400	1000
Ave( $\bar{X}_N$ )	-0.0055	-0.0032	-0.0029	-0.0009	-0.0008	0.0001
Ave( $\bar{\epsilon}_N$ )	-0.0168	-0.0135	-0.0060	-0.0037	-0.0024	0.0003
Std( $\bar{X}_N$ )	0.1922	0.1068	0.0616	0.0347	0.0154	0.0102
Std( $\bar{\epsilon}_N$ )	0.3511	0.2351	0.1575	0.0967	0.0464	0.0312

Theoretical Validation

Standard deviation decreases at rate

O(1/\sqrt{N})

, confirming CLT

Std(

\bar{\epsilon}_N

) > Std(

\bar{X}_N

) shows AR smoothing effect

Sample means converge to 0 (true mean), validating consistency

Autocovariance Function Estimation

Sample Autocovariance

Basic Definitions

Sample Autocovariance Function

\hat{\gamma}_k = \frac{1}{N}\sum_{j=1}^{N-k}(X_j-\bar{X}_N)(X_{j+k}-\bar{X}_N), \quad 0 \leq k \leq N-1

Sample Autocorrelation Function

\hat{\rho}_k = \frac{\hat{\gamma}_k}{\hat{\gamma}_0}, \quad |k| \leq N-1

Why Divide by N instead of N-k?

Critical Reason: Dividing by N ensures positive definiteness of the sample autocovariance matrix.

1.Dividing by N-k might seem "more unbiased" for individual lags
2.However, it can produce non-positive-definite covariance matrices
3.N-divisor guarantees all eigenvalues ≥ 0, essential for valid inference

Positive Definiteness Theorem

Theorem

If sample observations $X_1, X_2, \ldots, X_N$ are not all equal, then the sample autocovariance matrix $\hat{\Gamma}_N = (\hat{\gamma}_{k-j})_{k,j=1,\ldots,N}$ is positive definite.

Proof (Constructive)

Step 1: Define $Y_j = X_j - \bar{X}_N$ (centered observations)

Step 2: Construct the lower triangular matrix A:

A = \begin{pmatrix} 0 & Y_1 & Y_2 & \cdots & Y_{N-1} & Y_N \\ 0 & 0 & Y_1 & \cdots & Y_{N-2} & Y_{N-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & Y_1 & Y_2 \\ 0 & 0 & 0 & \cdots & 0 & Y_1 \end{pmatrix}

Step 3: Show that $\hat{\Gamma}_N = \frac{1}{N}AA^T$

Step 4: Since $Y_j$ not all zero, A has full rank N

Conclusion: $AA^T$ is positive definite, so $\hat{\Gamma}_N$ is positive definite

Consistency Analysis

Theorem 1: Asymptotic Unbiasedness

For a stationary process:

\lim_{N\to\infty} E\hat{\gamma}_k = \gamma_k

The estimator is asymptotically unbiased, though it may have finite-sample bias.

Theorem 2: Strong Consistency

If $\{X_t\}$ is a strictly stationary and ergodic sequence:

\lim_{N\to\infty} \hat{\gamma}_k = \gamma_k, \quad a.s.

\lim_{N\to\infty} \hat{\rho}_k = \rho_k, \quad a.s.

Ergodicity enables time averages to replace ensemble averages with probability 1.

Asymptotic Distribution Theory

General Asymptotic Normality

Key Parameters

Fourth Moment

\mu_4 = E\epsilon_t^4

Normalized excess kurtosis

M_0 = \frac{1}{\sigma^2}(\mu_4 - \sigma^4)^{1/2}

Asymptotic Distribution

Under spectral density condition $\int_{-\pi}^{\pi} f(\lambda)^2 d\lambda < \infty$ :

\sqrt{N}(\hat{\gamma}_0-\gamma_0, \hat{\gamma}_1-\gamma_1, \ldots, \hat{\gamma}_h-\gamma_h) \xrightarrow{d} (\xi_0, \xi_1, \ldots, \xi_h)

\sqrt{N}(\hat{\rho}_1-\rho_1, \hat{\rho}_2-\rho_2, \ldots, \hat{\rho}_h-\rho_h) \xrightarrow{d} (R_1, R_2, \ldots, R_h)

where $\xi_j$ and $R_j$ are defined through weighted sums of i.i.d. $N(0,1)$ random variables.

MA(q) Specific Case

For $m > q$ in an MA(q) model:

\sqrt{N}\hat{\rho}_m \xrightarrow{d} N(0, 1+2\rho_1^2+\cdots+2\rho_q^2)

This provides basis for white noise testing and model order selection.

AR(1) Specific Case

For AR(1) with $\rho = a^m$ :

\sqrt{N}(\hat{\rho}_m - \rho_m) \xrightarrow{d} N(0, V_m)

V_m = \frac{(1+a^2)(1-a^{2m})}{1-a^2} - 2ma^{2m}

Worked Example: MA(1) Parameter Estimation

Complete Estimation Procedure

Problem Setup

Given an MA(1) model with $N=100$ observations and estimated parameter $\hat{\theta} = 0.6$ . Construct a 95% confidence interval for the true parameter $\theta$ .

Model Specification

X_t = \epsilon_t + \theta\epsilon_{t-1}, \quad \epsilon_t \stackrel{i.i.d.}{\sim} N(0,\sigma^2)

Step-by-Step Solution

1

Determine Asymptotic Variance

For MA(1), the asymptotic variance of MLE is:

\text{Avar}(\sqrt{N}\hat{\theta}) = \frac{1-\theta^2}{1+\theta^2+\theta^4}

2

Plug-in Estimation

Substitute $\hat{\theta} = 0.6$ :

\widehat{\text{Avar}} = \frac{1-0.6^2}{1+0.6^2+0.6^4} = \frac{0.64}{1.4896} \approx 0.4297

\widehat{\text{SE}}(\hat{\theta}) = \sqrt{\frac{0.4297}{100}} \approx 0.0656

3

Construct Confidence Interval

Using normal approximation (95% → z = 1.96):

\text{CI} = 0.6 \pm 1.96 \times 0.0656

= 0.6 \pm 0.1286 = [0.471, 0.729]

4

Interpretation

With 95% confidence, the true MA parameter θ lies in [0.471, 0.729]. Since the interval doesn't contain 0, we have strong evidence that MA component is significant.

Alternative: Bootstrap Approach

For small samples or non-normal innovations, bootstrap confidence intervals may be more accurate:

1.Estimate model and obtain residuals $\hat{\epsilon}_t$
2.Resample residuals with replacement: $\epsilon_t^*$
3.Generate bootstrap series: $X_t^* = \epsilon_t^* + \hat{\theta}\epsilon_{t-1}^*$
4.Re-estimate model on $X_t^*$ to get $\hat{\theta}^*$
5.Repeat B=1000 times, use percentiles of $\{\hat{\theta}^*_b\}$

Model Diagnostics & Residual Analysis

Comprehensive procedures for assessing model adequacy

Three-Stage Diagnostic Framework

1

Residual Calculation

Compute standardized residuals:

e_t = \frac{X_t - \hat{X}_{t|t-1}}{\sqrt{\hat{\sigma}_t^2}}

where $\hat{X}_{t|t-1}$ is the one-step-ahead forecast.

2

Graphical Analysis

•Time series plot of residuals
•ACF/PACF of residuals
•QQ-plot for normality
•Histogram + density estimate
•Residuals vs. fitted values

3

Formal Tests

•Ljung-Box test (H₀: white noise)
•Jarque-Bera test (H₀: normality)
•ARCH test (H₀: homoscedasticity)
•Runs test (H₀: randomness)

Ljung-Box Q-Statistic

Modified version of Box-Pierce with better small-sample properties:

Q_{LB}(m) = N(N+2)\sum_{k=1}^m \frac{\hat{\rho}_k^2}{N-k} \sim \chi^2(m-p-q)

Degrees of freedom adjusted for estimated ARMA(p,q) parameters.

Decision Rules

Model is Adequate if:

✓ Residual ACF within confidence bands
✓ Ljung-Box p-value > 0.05
✓ QQ-plot approximately linear
✓ No obvious patterns in residual plot

Model Needs Revision if:

✗ Multiple ACF lags significant
✗ Ljung-Box p-value < 0.05
✗ Heavy tails in QQ-plot
✗ Systematic patterns/heteroscedasticity

Practical Workflow Example

Scenario: Quarterly sales data (N=80)

After differencing and seasonal adjustment, you fit an ARMA(1,1) model. Estimated parameters: φ=0.7, θ=0.4, σ²=2.5. Assess model adequacy.

Step 1: Compute and Plot Residuals

Generate standardized residuals e_t and create time series plot. ✓ No obvious trends or volatility clustering observed.

Step 2: ACF Analysis

Compute sample ACF up to lag 20. Confidence bands: ±1.96/√80 ≈ ±0.219. ✓ All lags within bands except lag 12 (ρ̂₁₂ = 0.23), possibly spurious.

Step 3: Ljung-Box Test

Test up to lag m=15 (adjusted df = 15-2=13):

Q_{LB}(15) = 80 \times 82 \sum_{k=1}^{15} \frac{\hat{\rho}_k^2}{80-k} = 16.7

Critical value: χ²(13, 0.95) ≈ 22.36. Since 16.7 < 22.36, fail to reject H₀. ✓

Step 4: Normality Check

QQ-plot shows good alignment with theoretical quantiles except slight heaviness in right tail. Jarque-Bera test p-value = 0.08. ✓ Acceptable at 5% level.

Conclusion

ARMA(1,1) model appears adequate. All diagnostic tests support white noise assumption for residuals. Proceed with forecasting and inference.

White Noise Testing

Diagnostic tests for model adequacy

Chi-Square (Portmanteau) Test

Test Statistic

Under the white noise null hypothesis:

X^2(m) = N(\hat{\rho}_1^2 + \hat{\rho}_2^2 + \cdots + \hat{\rho}_m^2) \sim \chi^2(m)

Reject white noise if $X^2(m) > \chi^2_{m,1-\alpha}$ , where $\alpha$ is the significance level.

Advantages of Chi-Square Test

Joint Testing

Tests multiple lags simultaneously

Higher Power

More efficient than individual tests

Error Control

Automatic Type I error control

Parameter Selection

Choosing m: The number of lags to test

•Typical choice: $m \leq 10$ in practice
•Too large m reduces test power (autocorrelations quickly → 0)
•Too small m may miss important dependencies

ACF Confidence Interval Method

Individual Testing

Under white noise assumption, for each lag k:

P(\sqrt{N}|\hat{\rho}_k| > 1.96) \approx 0.05

95% Confidence Interval:

|\hat{\rho}_k| \leq \frac{1.96}{\sqrt{N}}

Multiple Testing Issue

When testing m lags simultaneously:

• Even if truly white noise, ~5% of lags will fall outside bounds

• For m=20 lags, expect ~1 false rejection

• Need to consider the overall pattern, not individual violations

Practical Recommendation

1

Use chi-square test for overall assessment of white noise

2

Plot ACF with confidence bands to identify problematic lags

3

Look for systematic patterns, not isolated exceedances

Practical Guidelines & Pitfalls

Common Estimation Pitfalls

Insufficient Sample Size

Asymptotic results (CLT, consistency) rely on $N \to \infty$ . For $N < 50$ , estimates can be heavily biased.Recommendation: Use bootstrap methods or small-sample corrections (e.g., AICc) for short series.

Ignoring Non-Stationarity

Applying standard estimation to non-stationary data (trends, unit roots) yields spurious results.Recommendation: Always perform unit root tests (ADF, KPSS) and difference data if necessary before estimation.

Over-Parameterization

Fitting high-order ARMA models to capture noise leads to high variance and poor forecasting.Recommendation: Adhere to the Principle of Parsimony. Use AIC/BIC for model selection.

Best Practices Checklist

Visual Inspection First

Plot the time series, ACF, and PACF before any modeling. Look for outliers, seasonality, and trends.

Residual Diagnostics

Never accept a model without checking residuals for whiteness (Ljung-Box) and normality (QQ-plot).

Compare Multiple Models

Don't stop at the first "good" model. Compare 2-3 candidates using Information Criteria and out-of-sample validation.

Report Uncertainty

Always provide confidence intervals for parameters and prediction intervals for forecasts.

Frequently Asked Questions

1Why do we divide by N instead of N-k when estimating autocovariance?

Dividing by N (rather than N-k) ensures that the sample autocovariance matrix is positive definite, which is crucial for statistical inference. While N-k might seem more 'unbiased' for large k, it can lead to non-positive-definite covariance matrices,破坏ing the mathematical properties needed for estimation and hypothesis testing.

2What is the difference between consistency and strong consistency?

Consistency means the estimator converges to the true value in probability (Xbar_n →^p μ), while strong consistency means almost sure convergence (Xbar_n → μ a.s.). Strong consistency is a stronger condition that requires ergodicity of the sequence. In practice, for strictly stationary ergodic sequences, we have strong consistency, which provides stronger guarantees about estimation accuracy.

3How does spectral density f(0) affect convergence速度?

The spectral density at frequency zero, f(0), determines the asymptotic variance of the sample mean. It captures the 'long-run variance' of the process. Higher f(0) means stronger long-term dependence and slower convergence. The asymptotic variance is 2πf(0)/N, so processes with f(0)=0 (which don't exist in practice) would converge infinitely fast, while those with large f(0) converge more slowly.

4When should I use chi-square test vs individual ACF confidence intervals?

The chi-square test (Portmanteau test) is more powerful for detecting overall departure from white noise because it jointly tests multiple lags. Individual ACF tests are useful for identifying specific problematic lags but suffer from multiple testing issues. For model diagnostics, use both: chi-square for overall assessment and ACF plot for identifying which lags are problematic.

5What does the Law of Iterated Logarithm tell us that CLT doesn't?

While CLT gives the rate O(1/√N), the LIL provides the exact bounds for fluctuations: the sample mean oscillates infinitely often between ±√(2πf(0))·√(2 ln ln N/N). This gives us the 'worst-case' behavior and shows that √N(Xbar-μ) doesn't converge but oscillates within precise bounds. It's like knowing not just the average error, but the maximum likely deviation.

Chapter Summary

Key Theoretical Results

Mean Estimation

• Consistency under γ_m → 0
• Strong consistency for ergodic sequences
• CLT with variance 2πf(0)

Convergence Rates

• CLT: O(1/√N)
• LIL: O(√(ln ln N / N))
• Oscillation bounds: ±√(2πf(0))

Auto covariance

• Positive definiteness with N-divisor
• Asymptotic unbiasedness
• Strong consistency (ergodic case)

White Noise Tests

• Chi-square (Portmanteau) test
• ACF confidence bands: ±1.96/√N
• Multiple testing considerations

Practical Skills Acquired

Compute and interpret sample ACF/PACF

Construct asymptotic confidence intervals

Perform chi-square white noise tests

Interpret simulation convergence results

Assess estimation precision requirements

Diagnose model adequacy using residuals

Statistical Inference & Estimation