Master the fundamental principles of statistical hypothesis testing: from basic concepts and error analysis to advanced methods and real-world applications in statistical inference.
Core concepts in hypothesis testing theory
The baseline hypothesis under test, typically containing '=', '≥', or '≤', representing the status quo or no effect condition.
Mathematical:
Example:
H₀: μ = μ₀ (population mean equals specified value)
The hypothesis that contradicts H₀, typically containing '≠', '>', or '<', representing what we're trying to detect.
Mathematical:
Example:
H₁: μ ≠ μ₀ (two-sided), H₁: μ > μ₀ (right-sided)
The probability of rejecting H₀ when it is actually true (false positive). Controlled by significance level.
Mathematical:
Example:
α = 0.05 means 5% chance of false rejection
The probability of failing to reject H₀ when H₁ is true (false negative). Related to statistical power.
Mathematical:
Example:
Power = 1 - β measures test's ability to detect true effects
H₀ and H₁ must be mutually exclusive and collectively exhaustive
H₀ typically represents the current belief, no change, or no effect
H₁ represents what requires evidence to establish (burden of proof)
Choose one-sided or two-sided based on research question
Structure:
Rejection Region:
Example: Testing if population mean differs from specified value
Applications:
Structure:
Rejection Region:
Example: Testing if new process increases efficiency
Applications:
Structure:
Rejection Region:
Example: Testing if new method reduces error rate
Applications:
Understanding Type I and Type II errors, and optimizing test power
Optimal test construction for simple hypotheses
Control the maximum Type I error probability at level α, and among all such tests, choose the one with minimum Type II error (maximum power).
Very strong evidence required
Standard in most fields
Among all tests with same significance level, choose the one with highest power (Uniformly Most Powerful when exists)
State H₀ and H₁, then select appropriate test statistic
Based on H₁ direction and significance level α
Compute statistic value using sample data
Compare statistic to critical value and calculate P-value
Compare test statistic to critical value
Advantages: Direct comparison, Clear decision boundary
Disadvantages: Doesn't show strength of evidence
Compare P-value to significance level
Advantages: Shows strength of evidence, More informative
Disadvantages: Can be misinterpreted
Reject H₀:
Strong evidence against H₀ in favor of H₁
Fail to Reject H₀:
Insufficient evidence to reject H₀ (not proof of H₀)
Common Mistakes:
For testing simple hypotheses H₀: θ = θ₀ vs H₁: θ = θ₁, the likelihood ratio test is the most powerful test of size α.
Then φ is the most powerful level-α test for H₀ vs H₁.
Consider simple hypothesis testing: H₀: θ = θ₀ versus H₁: θ = θ₁. Let X = (X₁, ..., Xₙ) be the data vector with likelihood functions L₀(x) = L(θ₀; x) and L₁(x) = L(θ₁; x).
Define the rejection region based on likelihood ratio: D_k = {x : L₁(x)/L₀(x) > k}, where k is chosen to satisfy the size constraint. This forms the basis of the likelihood ratio test.
Choose k such that the test has exactly size α: P_θ₀(X ∈ D_k) = α. Under regularity conditions, there exists such a k. This ensures the Type I error rate is controlled at level α.
Let D' be any other rejection region satisfying P_θ₀(X ∈ D') ≤ α. We need to show that P_θ₁(X ∈ D_k) ≥ P_θ₁(X ∈ D'), i.e., the LRT has maximum power among all level-α tests.
For x ∈ D_k, we have L₁(x) > kL₀(x). For x ∈ D', we have L₁(x) ≤ kL₀(x) or L₁(x) > kL₀(x). Using indicator functions, we can write the power difference as an integral involving these inequalities.
Since L₁(x) > kL₀(x) on D_k and L₁(x) ≤ kL₀(x) on the complement, the integral difference is non-negative, proving P_θ₁(X ∈ D_k) ≥ P_θ₁(X ∈ D'). The likelihood ratio test is UMP (Uniformly Most Powerful) for simple hypotheses.
For testing H₀: μ = 0 vs H₁: μ = 1 in N(μ, 1), the LRT reduces to rejecting H₀ when X̄ > c, which is the most powerful test.
Under regularity conditions, the generalized likelihood ratio statistic -2 log λ converges in distribution to a chi-square distribution as sample size approaches infinity.
where r = dim(Θ) - dim(Θ₀) is the difference in parameter dimensions.
Consider the generalized likelihood ratio Λ(X) = L(θ̂₀; X) / L(θ̂; X), where θ̂ is the unrestricted MLE and θ̂₀ is the MLE under H₀: θ ∈ Θ₀. The statistic ranges from 0 to 1.
Consider the log-likelihood ratio: -2 log Λ = 2[ℓ(θ̂) - ℓ(θ̂₀)], where ℓ(θ) = log L(θ; X) is the log-likelihood. This transformation is monotone and more analytically tractable.
Expand ℓ(θ̂) and ℓ(θ̂₀) around the true θ₀ (assuming H₀ is true). Using Taylor's theorem to second order, we get quadratic forms involving the score and information matrix.
By the asymptotic normality of the MLE, we have √n(θ̂ - θ₀) →ᵈ N(0, I(θ₀)⁻¹), where I(θ₀) is the Fisher information matrix. This is a fundamental result in maximum likelihood theory.
The Fisher information matrix can be consistently estimated by the observed information. By LLN, the empirical information converges to the true Fisher information: În → I(θ₀) in probability.
Combining steps 3-5 with Slutsky's theorem, -2 log λ asymptotically equals a quadratic form of a multivariate normal vector, which follows a χ²(r) distribution, where r is the difference in dimensions between full and restricted parameter spaces.
Testing H₀: μ₁ = μ₂ = μ₃ in three normal populations, -2 log λ follows approximately χ²(2) for large samples.
Standard tests for normal populations and common parameters
Scenario:
Testing population mean with known variance
Hypotheses:
Assumptions:
Test Statistic:
Rejection Regions:
Example Application:
Testing if mean height = 170cm with σ = 5cm known
Scenario:
Testing population mean with unknown variance
Hypotheses:
Assumptions:
Test Statistic:
Rejection Regions:
Example Application:
Testing if new teaching method improves test scores
Scenario:
Comparing means with unknown but equal variances
Hypotheses:
Assumptions:
Test Statistic:
Pooled Variance:
Rejection Regions:
Example Application:
Comparing test scores between two teaching methods
A general method for constructing hypothesis tests using likelihood functions
Motivation:
When optimal tests don't exist or are unknown, GLRT provides a systematic approach
Principle:
Compare maximum likelihood under full parameter space to maximum likelihood under null hypothesis constraint
where
There's a one-to-one correspondence between confidence intervals and hypothesis tests at the same confidence/significance level
From acceptance regions to confidence sets
Explanation: The confidence set contains all parameter values that would not be rejected by the test
From confidence sets to acceptance regions
Explanation: Accept H₀: θ = θ₀ if and only if θ₀ lies within the confidence interval
Practical applications of hypothesis testing across different domains
The null hypothesis H₀ is the hypothesis we try to challenge (usually representing "no effect" or "no difference"), while the alternative hypothesis H₁ is what we seek evidence to support. In hypothesis testing, we always start from the premise "assume H₀ is true," then see if the data provides strong enough evidence to reject it. This asymmetry reflects the "skepticism" principle in the scientific method.
This stems from the philosophical foundation of the Neyman-Pearson principle. Type I error (rejecting true H₀) usually has more serious consequences because it means we incorrectly claim to have discovered some effect. Type II error (failing to reject false H₀) merely means we haven't found sufficient evidence. In scientific research, we prefer "better to miss than to wrongly assert."
This is one of the most common misunderstandings in hypothesis testing. "Fail to reject H₀" only means the data didn't provide strong enough evidence to refute H₀, not that H₀ is necessarily true. This is like in court "insufficient evidence" ≠ "innocent." We can never prove H₀ is true, only say the data is compatible with H₀.
Key Point: Absence of evidence is not evidence of absence
This depends on your research question. If you only care whether the parameter deviates in one direction (e.g., "does the new drug improve efficacy"), use a one-sided test. If you care whether the parameter differs from a value (regardless of direction), use a two-sided test. Principle: decide based on substantive research questions, not data, and determine before seeing the data.
Comparison: One-sided tests have higher power but can only detect differences in one direction; two-sided tests are more conservative but can detect both directions
The P-value is the probability of observing the current data or more extreme data under the assumption that H₀ is true. α is the threshold we set beforehand. Decision rule: if P-value < α, reject H₀. Note that the P-value is not "the probability that H₀ is true" (that's a Bayesian posterior probability concept).
The key is whether the population variance is known. If population variance σ² is known, use z-test (U-test); if σ² is unknown and needs to be estimated with sample variance, use t-test. In practice, population variance is usually unknown, so t-test is more common. When sample size is large (n > 30), the t-distribution approximates the normal distribution, and results are similar.
This is mainly a historical convention rather than mathematical necessity. R.A. Fisher proposed 0.05 as a "suspicious" threshold in the 1920s, which later became conventional standard. Actually, the choice of α should be based on field characteristics and error costs: medical research often uses 0.01 (more strict), exploratory research may use 0.10 (more relaxed). Importantly, determine α before data collection and clearly state it in reports.
Historical Note: Fisher originally described 0.05 as a "convenient approximation," not an absolute standard
They have a precise duality relationship. At the same significance level, if parameter value θ₀ falls within the (1-α) confidence interval, then we cannot reject H₀: θ = θ₀ at level α testing. Vice versa. Confidence intervals provide more information than hypothesis tests: they not only tell us whether to reject a specific value but also give the range of all plausible parameter values.