Hypothesis Testing

6-8 Hours

Hypothesis Testing

Master the fundamental principles of statistical hypothesis testing: from basic concepts and error analysis to advanced methods and real-world applications in statistical inference.

Essential Definitions

Core concepts in hypothesis testing theory

Null Hypothesis (H₀)

The baseline hypothesis under test, typically containing '=', '≥', or '≤', representing the status quo or no effect condition.

Mathematical:

H_0: \theta \in \Theta_0

Example:

H₀: μ = μ₀ (population mean equals specified value)

Alternative Hypothesis (H₁)

The hypothesis that contradicts H₀, typically containing '≠', '>', or '<', representing what we're trying to detect.

Mathematical:

H_1: \theta \in \Theta_1

Example:

H₁: μ ≠ μ₀ (two-sided), H₁: μ > μ₀ (right-sided)

Type I Error (α)

The probability of rejecting H₀ when it is actually true (false positive). Controlled by significance level.

Mathematical:

\alpha(\theta) = P_{\theta}(X \in D | \theta \in \Theta_0)

Example:

α = 0.05 means 5% chance of false rejection

Type II Error (β)

The probability of failing to reject H₀ when H₁ is true (false negative). Related to statistical power.

Mathematical:

\beta(\theta) = P_{\theta}(X \in \overline{D} | \theta \in \Theta_1)

Example:

Power = 1 - β measures test's ability to detect true effects

Hypothesis Construction

Hypothesis Construction Principles

Guidelines for properly formulating null and alternative hypotheses

Complementary Hypotheses

H₀ and H₁ must be mutually exclusive and collectively exhaustive

Mathematical:

\Theta_0 \cap \Theta_1 = \emptyset \text{ and } \Theta_0 \cup \Theta_1 = \Theta

Example: For μ: H₀: μ = μ₀ vs H₁: μ ≠ μ₀

Status Quo in H₀

H₀ typically represents the current belief, no change, or no effect

Mathematical:

H_0: \text{parameter} = \text{claimed value}

Example: Testing drug effectiveness: H₀: drug has no effect

Burden of Proof

H₁ represents what requires evidence to establish (burden of proof)

Mathematical:

H_1: \text{what we want to detect}

Example: Proving guilt: H₀: innocent, H₁: guilty

Directionality

Choose one-sided or two-sided based on research question

Mathematical:

\text{Two-sided: } H_1: \theta \neq \theta_0 \text{, One-sided: } H_1: \theta > \theta_0

Example: Quality control often uses one-sided tests

Types of Hypothesis Tests

Classification based on alternative hypothesis structure

Two-Sided (Two-Tailed)

Structure: $H₀: θ = θ₀ vs H₁: θ ≠ θ₀$

Rejection Region: $T < c₁ or T > c₂$

Example: Testing if population mean differs from specified value

Applications:

Quality assurance
Scientific experiments
A/B testing

Right-Sided (Upper-Tailed)

Structure: $H₀: θ ≤ θ₀ vs H₁: θ > θ₀$

Rejection Region: $T > c$

Example: Testing if new process increases efficiency

Applications:

Process improvement
Treatment effectiveness
Performance enhancement

Left-Sided (Lower-Tailed)

Structure: $H₀: θ ≥ θ₀ vs H₁: θ < θ₀$

Rejection Region: $T < c$

Example: Testing if new method reduces error rate

Applications:

Cost reduction
Risk minimization
Error rate improvement

Error Analysis & Statistical Power

Understanding Type I and Type II errors, and optimizing test power

Type I Error (False Positive)

Rejecting true H₀ - the probability of a false alarm

\alpha = \max_{\theta \in \Theta_0} P_{\theta}(\text{Reject } H_0)

Characteristics:

Controlled by significance level α
Common values: α = 0.01, 0.05, 0.10

Real-world analogy: Convicting an innocent person (α-error in justice system)

Type II Error (False Negative)

Failing to reject false H₀ - missing a true effect

\beta(\theta) = P_{\theta}(\text{Accept } H_0 | \theta \in \Theta_1)

Characteristics:

Depends on true parameter value
Related to test power: Power = 1 - β

Real-world analogy: Failing to detect a disease when it's present

Power Function

Probability of correctly rejecting H₀ when it's false

g(\theta) = P_{\theta}(X \in D) = P_{\theta}(\text{Reject } H_0)

Key Properties:

For $\theta \in \Theta_0$ : $g(\theta) = \alpha(\theta)$ (Type I error rate)
For $\theta \in \Theta_1$ : $g(\theta) = 1 - \beta(\theta)$ (Power)

Factors Affecting Power:

Sample size ( $\text{larger n}$ increases power)
Significance level ( $\text{larger } \alpha$ increases power)

Error Trade-off Relationship

For fixed sample size, Type I and Type II errors are inversely related

\text{As } \alpha \downarrow \text{, then } \beta \uparrow \text{ (for fixed n)}

Solutions to Improve Both Errors:

Increase sample size n (reduces both errors)
Use more informative experimental design

Neyman-Pearson Principle

Optimal test construction for simple hypotheses

Neyman-Pearson Principle

The fundamental framework for controlling Type I error while minimizing Type II error

Principle Statement:

Control the maximum Type I error probability at level α, and among all such tests, choose the one with minimum Type II error (maximum power).

\sup_{\theta \in \Theta_0} \alpha(\theta) \leq \alpha

Key Advantages:

Provides clear error control framework
Enables comparison between different tests
Forms basis for optimal test construction
Widely applicable across statistical problems

Significance Level (α)

The maximum allowable Type I error probability

α = 0.01

Very strong evidence required

Common in: Medical trials, Safety testing

α = 0.05

Standard in most fields

Common in: Social sciences, Quality control

Critical Value Selection:

Choose critical value c such that \sup_{\theta \in \Theta_0} P(T > c | H_0) = \alpha

Optimal Test Construction

Steps to construct tests following Neyman-Pearson principle

1
Specify H₀ and H₁ clearly
2
Choose significance level α and identify test statistic T
3
Determine rejection region D to satisfy size constraint

Optimality Concept:

Among all tests with same significance level, choose the one with highest power (Uniformly Most Powerful when exists)

Testing Procedure & Decision Rules

General Hypothesis Testing Procedure

Standard framework for conducting statistical hypothesis tests

Formulate Hypotheses & Choose Test

State H₀ and H₁, then select appropriate test statistic

Key Details:

Define parameter of interest
Select test statistic based on assumptions

Example:

H_0: \mu = 50 \text{ vs } H_1: \mu \neq 50, \text{ use } T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}}

Determine Rejection Region

Based on H₁ direction and significance level α

Key Details:

Two-sided: |T| > c
One-sided: T > c or T < c

Example:

\text{For } \alpha = 0.05\text{, two-sided: } |T| > t_{0.025}(n-1)

Calculate Test Statistic

Compute statistic value using sample data

Key Details:

Substitute sample values
Verify calculation accuracy

Example:

t = \frac{15.2 - 15.0}{0.8/\sqrt{25}} = 1.25

Make Decision & Report P-value

Compare statistic to critical value and calculate P-value

Key Details:

State decision clearly
Report P-value

Example:

\text{Since } |1.25| < 2.064\text{, fail to reject } H_0; \text{ P-value} = 0.224

Decision Rules and Interpretation

Guidelines for making and interpreting test decisions

Critical Value Approach

Compare test statistic to critical value

Decision Rule:

\text{If } |T| > c_{\alpha/2} \text{, reject } H_0

Advantages: Direct comparison, Clear decision boundary

Disadvantages: Doesn't show strength of evidence

P-value Approach

Compare P-value to significance level

Decision Rule:

\text{If P-value} < \alpha \text{, reject } H_0

Advantages: Shows strength of evidence, More informative

Disadvantages: Can be misinterpreted

Interpretation Guidelines

Reject H₀:

Strong evidence against H₀ in favor of H₁

Fail to Reject H₀:

Insufficient evidence to reject H₀ (not proof of H₀)

Common Mistakes:

• Don't say 'accept H₀' - we never prove H₀
• Don't confuse statistical and practical significance
• P-value is not probability that H₀ is true

Core Theorem Proofs

Neyman-Pearson Lemma

Most Powerful Test Construction

For testing simple hypotheses H₀: θ = θ₀ vs H₁: θ = θ₁, the likelihood ratio test is the most powerful test of size α.

Theorem Statement

\text{Let } \phi(\mathbf{x}) = \begin{cases} 1 & \text{if } \frac{L(\theta_1; \mathbf{x})}{L(\theta_0; \mathbf{x})} > k \\ 0 & \text{otherwise} \end{cases}

Then φ is the most powerful level-α test for H₀ vs H₁.

Proof Steps

Define Simple Hypothesis Problem

Consider simple hypothesis testing: H₀: θ = θ₀ versus H₁: θ = θ₁. Let X = (X₁, ..., Xₙ) be the data vector with likelihood functions L₀(x) = L(θ₀; x) and L₁(x) = L(θ₁; x).

H_0: \theta = \theta_0 \quad \text{vs} \quad H_1: \theta = \theta_1

Construct Likelihood Ratio Rejection Region

Define the rejection region based on likelihood ratio: D_k = {x : L₁(x)/L₀(x) > k}, where k is chosen to satisfy the size constraint. This forms the basis of the likelihood ratio test.

D_k = \{\mathbf{x} : \lambda(\mathbf{x}) = \frac{L_1(\mathbf{x})}{L_0(\mathbf{x})} > k\}

Verify Significance Level Constraint

Choose k such that the test has exactly size α: P_θ₀(X ∈ D_k) = α. Under regularity conditions, there exists such a k. This ensures the Type I error rate is controlled at level α.

\alpha = P_{\theta_0}(\mathbf{X} \in D_k) = \int_{D_k} L_0(\mathbf{x}) \, d\mathbf{x}

Compare with Any Same-Level Test

Let D' be any other rejection region satisfying P_θ₀(X ∈ D') ≤ α. We need to show that P_θ₁(X ∈ D_k) ≥ P_θ₁(X ∈ D'), i.e., the LRT has maximum power among all level-α tests.

\text{Goal: } P_{\theta_1}(\mathbf{X} \in D_k) \geq P_{\theta_1}(\mathbf{X} \in D') \text{ for all } D' \text{ with } P_{\theta_0}(\mathbf{X} \in D') \leq \alpha

Apply Indicator Function Algebra

For x ∈ D_k, we have L₁(x) > kL₀(x). For x ∈ D', we have L₁(x) ≤ kL₀(x) or L₁(x) > kL₀(x). Using indicator functions, we can write the power difference as an integral involving these inequalities.

P_{\theta_1}(D_k) - P_{\theta_1}(D') = \int_{D_k \setminus D'} L_1(\mathbf{x}) d\mathbf{x} - \int_{D' \setminus D_k} L_1(\mathbf{x}) d\mathbf{x}

Conclude Optimality (UMP)

Since L₁(x) > kL₀(x) on D_k and L₁(x) ≤ kL₀(x) on the complement, the integral difference is non-negative, proving P_θ₁(X ∈ D_k) ≥ P_θ₁(X ∈ D'). The likelihood ratio test is UMP (Uniformly Most Powerful) for simple hypotheses.

\therefore \phi_k \text{ is UMP level-}\alpha\text{ test for } H_0 \text{ vs } H_1

Example Application

For testing H₀: μ = 0 vs H₁: μ = 1 in N(μ, 1), the LRT reduces to rejecting H₀ when X̄ > c, which is the most powerful test.

Wilks' Theorem

Asymptotic Distribution of GLRT

Under regularity conditions, the generalized likelihood ratio statistic -2 log λ converges in distribution to a chi-square distribution as sample size approaches infinity.

Theorem Statement

-2\log \lambda(\mathbf{X}) \xrightarrow{d} \chi^2(r) \text{ as } n \to \infty

where r = dim(Θ) - dim(Θ₀) is the difference in parameter dimensions.

Proof Steps

Define Likelihood Ratio Statistic

Consider the generalized likelihood ratio Λ(X) = L(θ̂₀; X) / L(θ̂; X), where θ̂ is the unrestricted MLE and θ̂₀ is the MLE under H₀: θ ∈ Θ₀. The statistic ranges from 0 to 1.

\lambda(\mathbf{X}) = \frac{\sup_{\theta \in \Theta_0} L(\theta; \mathbf{X})}{\sup_{\theta \in \Theta} L(\theta; \mathbf{X})} = \frac{L(\hat{\theta}_0; \mathbf{X})}{L(\hat{\theta}; \mathbf{X})}

Logarithmic Transformation

Consider the log-likelihood ratio: -2 log Λ = 2[ℓ(θ̂) - ℓ(θ̂₀)], where ℓ(θ) = log L(θ; X) is the log-likelihood. This transformation is monotone and more analytically tractable.

-2\log\lambda = 2[\ell(\hat{\theta}) - \ell(\hat{\theta}_0)]

Taylor Expansion

Expand ℓ(θ̂) and ℓ(θ̂₀) around the true θ₀ (assuming H₀ is true). Using Taylor's theorem to second order, we get quadratic forms involving the score and information matrix.

\ell(\hat{\theta}) \approx \ell(\theta_0) + \nabla\ell(\theta_0)^T(\hat{\theta} - \theta_0) - \frac{1}{2}(\hat{\theta} - \theta_0)^T I(\theta_0)(\hat{\theta} - \theta_0)

Apply Central Limit Theorem

By the asymptotic normality of the MLE, we have √n(θ̂ - θ₀) →ᵈ N(0, I(θ₀)⁻¹), where I(θ₀) is the Fisher information matrix. This is a fundamental result in maximum likelihood theory.

\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})

Apply Law of Large Numbers

The Fisher information matrix can be consistently estimated by the observed information. By LLN, the empirical information converges to the true Fisher information: În → I(θ₀) in probability.

\hat{I}_n = -\frac{1}{n}\nabla^2 \ell(\hat{\theta}) \xrightarrow{P} I(\theta_0)

Conclude Chi-Square Distribution

Combining steps 3-5 with Slutsky's theorem, -2 log λ asymptotically equals a quadratic form of a multivariate normal vector, which follows a χ²(r) distribution, where r is the difference in dimensions between full and restricted parameter spaces.

-2\log\lambda \xrightarrow{d} \chi^2(r), \quad r = \dim(\Theta) - \dim(\Theta_0)

Example Application

Testing H₀: μ₁ = μ₂ = μ₃ in three normal populations, -2 log λ follows approximately χ²(2) for large samples.

Common Statistical Tests

Standard tests for normal populations and common parameters

Single Normal Population Tests

Tests for normal population parameters with different scenarios

U-Test (Z-Test)

Scenario:

Testing population mean with known variance

Hypotheses:

H₀: μ = μ₀ vs H₁: μ ≠ μ₀, μ > μ₀, or μ < μ₀

Assumptions:

$X ~ N(μ, σ²)$
$σ² known$
$Random sample$

Test Statistic:

U = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \sim N(0,1) \text{ under } H_0

Rejection Regions:

$Two-sided: |U| > u_{α/2}$
$Right-sided: U > u_α$
$Left-sided: U < -u_α$

Example Application:

Testing if mean height = 170cm with σ = 5cm known

T-Test

Scenario:

Testing population mean with unknown variance

Hypotheses:

H₀: μ = μ₀ vs H₁: μ ≠ μ₀, μ > μ₀, or μ < μ₀

Assumptions:

$X ~ N(μ, σ²)$
$σ² unknown$
$Random sample$

Test Statistic:

T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t(n-1) \text{ under } H_0

Rejection Regions:

$Two-sided: |T| > t_{α/2}(n-1)$
$Right-sided: T > t_α(n-1)$
$Left-sided: T < -t_α(n-1)$

Example Application:

Testing if new teaching method improves test scores

Two Sample Comparison Tests

Tests comparing parameters between two independent populations

Two-Sample T-Test

Scenario:

Comparing means with unknown but equal variances

Hypotheses:

H₀: μ_X = μ_Y vs H₁: μ_X ≠ μ_Y, μ_X > μ_Y, or μ_X < μ_Y

Assumptions:

$X ~ N(μ_X, σ²), Y ~ N(μ_Y, σ²)$
$σ² unknown but equal$
$Independent samples$

Test Statistic:

T = \frac{\bar{X} - \bar{Y}}{S_w\sqrt{1/m + 1/n}} \sim t(m+n-2) \text{ under } H_0

Pooled Variance:

S_w^2 = \frac{(m-1)S_X^2 + (n-1)S_Y^2}{m+n-2}

Rejection Regions:

$Two-sided: |T| > t_{α/2}(m+n-2)$
$Right-sided: T > t_α(m+n-2)$
$Left-sided: T < -t_α(m+n-2)$

Example Application:

Comparing test scores between two teaching methods

Generalized Likelihood Ratio Test

A general method for constructing hypothesis tests using likelihood functions

Generalized Likelihood Ratio Test (GLRT)

A general method for constructing hypothesis tests using likelihood functions

Motivation:

When optimal tests don't exist or are unknown, GLRT provides a systematic approach

Principle:

Compare maximum likelihood under full parameter space to maximum likelihood under null hypothesis constraint

GLRT Construction

Likelihood Ratio Definition:

\lambda(\tilde{x}) = \frac{\sup_{\theta \in \Theta} L(\theta; \tilde{x})}{\sup_{\theta \in \Theta_0} L(\theta; \tilde{x})} = \frac{L(\hat{\theta}; \tilde{x})}{L(\hat{\theta}_0; \tilde{x})}

Components:

$L(θ;x̃): Likelihood function$
$θ̂: Unrestricted MLE (global maximum)$
$θ̂₀: Restricted MLE under H₀ (constrained maximum)$
$λ(x̃): Likelihood ratio statistic$

Test Rule:

\text{Reject } H_0 \text{ if } \lambda(\tilde{X}) > c

Critical Value:

\text{Choose } c \text{ such that } \sup_{\theta \in \Theta_0} P_{\theta}(\lambda(\tilde{X}) > c) \leq \alpha

GLRT Examples

Normal Mean Test (σ² unknown)

Hypotheses:

H₀: μ = μ₀ vs H₁: μ ≠ μ₀

Global MLE:

\hat{\mu} = \bar{X}, \hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2

Restricted MLE:

\hat{\mu}_0 = \mu_0, \hat{\sigma}_0^2 = \frac{1}{n}\sum(X_i - \mu_0)^2

Likelihood Ratio:

\lambda = \left(1 + \frac{t^2}{n-1}\right)^{n/2}

Test Statistic:

t = \frac{\bar{X} - \mu_0}{S/\sqrt{n}}

Equivalence:

\lambda \text{ monotone in } |t| \Rightarrow \text{GLRT equivalent to t-test}

Large Sample Properties

Wilks' Theorem:

Under regularity conditions: 2\log\lambda(\tilde{X}) \xrightarrow{d} \chi^2(r) \text{ as } n \to \infty

where $r = \dim(\Theta) - \dim(\Theta_0)$

Applications:

Provides approximate critical values for large samples
Enables testing in complex models where exact distributions unknown

Limitations:

Requires large sample sizes for accuracy
May not be optimal for specific alternatives

Confidence Intervals & Hypothesis Testing

Confidence Interval - Hypothesis Test Duality

Fundamental two-way relationship between interval estimation and hypothesis testing

There's a one-to-one correspondence between confidence intervals and hypothesis tests at the same confidence/significance level

Test → Interval

From acceptance regions to confidence sets

Formula:

C(\tilde{x}) = \{\theta_0 : \tilde{x} \in A(\theta_0)\}

Explanation: The confidence set contains all parameter values that would not be rejected by the test

Example: If |t| ≤ t_{α/2}(n-1) is acceptance region, then confidence interval is x̄ ± t_{α/2}(n-1) × s/√n

Interval → Test

From confidence sets to acceptance regions

Formula:

A(\theta_0) = \{\tilde{x} : \theta_0 \in C(\tilde{x})\}

Explanation: Accept H₀: θ = θ₀ if and only if θ₀ lies within the confidence interval

Example: Reject H₀: μ = μ₀ if μ₀ falls outside the 95% confidence interval for μ

Practical Implications:

Confidence intervals provide range of plausible parameter values
Hypothesis tests provide binary decisions about specific values
Intervals more informative for practical decision making

Duality Examples

Normal Mean (σ unknown)

Confidence Interval:

[\bar{x} - t_{\alpha/2}(n-1) \frac{s}{\sqrt{n}}, \bar{x} + t_{\alpha/2}(n-1) \frac{s}{\sqrt{n}}]

Hypothesis Test:

\text{Reject } H_0: \mu = \mu_0 \text{ if } \mu_0 \notin \text{CI}

Interpretation: 95% CI gives all μ₀ values that would not be rejected at α = 0.05

Real-World Applications

Practical applications of hypothesis testing across different domains

Quality Control Testing

Monitor production processes to ensure specifications are met

Common Scenarios:

Testing if mean product dimension meets target specification
Monitoring process variability within acceptable limits
Detecting shifts in production quality over time

Typical Tests:

One-sample t-test

Chi-square test for variance

Control charts

Key Considerations:

Economic consequences
Cost of Type I vs Type II errors
Sample size planning

Medical Research

Evaluate treatment effectiveness and drug safety

Common Scenarios:

Testing if new treatment improves patient outcomes
Comparing side effect rates between treatments
Establishing bioequivalence between generic and brand drugs

Typical Tests:

Two-sample t-test

Chi-square independence test

Equivalence testing

Key Considerations:

Patient safety
Regulatory requirements
Ethical implications

A/B Testing

Compare different versions to optimize performance

Common Scenarios:

Testing if new website design increases conversion rate
Comparing marketing campaign effectiveness
Evaluating user interface changes

Typical Tests:

Two-proportion z-test

Two-sample t-test

Chi-square test

Key Considerations:

Business impact
Sample size constraints
Multiple testing corrections

Environmental Monitoring

Assess compliance with environmental standards

Common Scenarios:

Testing if pollutant levels exceed safety thresholds
Monitoring changes in ecosystem health indicators
Evaluating effectiveness of environmental interventions

Typical Tests:

One-sample tests

Trend analysis

Non-parametric tests

Key Considerations:

Regulatory compliance
Public health impact
Measurement uncertainty

Frequently Asked Questions

Common Questions & Misconceptions

Clear explanations of fundamental concepts and common confusions in hypothesis testing

What is the fundamental difference between H₀ and H₁?

The null hypothesis H₀ is the hypothesis we try to challenge (usually representing "no effect" or "no difference"), while the alternative hypothesis H₁ is what we seek evidence to support. In hypothesis testing, we always start from the premise "assume H₀ is true," then see if the data provides strong enough evidence to reject it. This asymmetry reflects the "skepticism" principle in the scientific method.

H_0: \theta \in \Theta_0 \text{ vs } H_1: \theta \in \Theta_1, \quad \Theta_0 \cap \Theta_1 = \emptyset

Why control Type I error instead of Type II error?

This stems from the philosophical foundation of the Neyman-Pearson principle. Type I error (rejecting true H₀) usually has more serious consequences because it means we incorrectly claim to have discovered some effect. Type II error (failing to reject false H₀) merely means we haven't found sufficient evidence. In scientific research, we prefer "better to miss than to wrongly assert."

\alpha = P(\text{Reject } H_0 | H_0 \text{ true}) \leq 0.05

What does "fail to reject H₀" mean? Why not say "accept H₀"?

This is one of the most common misunderstandings in hypothesis testing. "Fail to reject H₀" only means the data didn't provide strong enough evidence to refute H₀, not that H₀ is necessarily true. This is like in court "insufficient evidence" ≠ "innocent." We can never prove H₀ is true, only say the data is compatible with H₀.

Key Point: Absence of evidence is not evidence of absence

How to choose between one-sided and two-sided tests?

This depends on your research question. If you only care whether the parameter deviates in one direction (e.g., "does the new drug improve efficacy"), use a one-sided test. If you care whether the parameter differs from a value (regardless of direction), use a two-sided test. Principle: decide based on substantive research questions, not data, and determine before seeing the data.

Comparison: One-sided tests have higher power but can only detect differences in one direction; two-sided tests are more conservative but can detect both directions

What is the relationship between P-value and significance level α?

The P-value is the probability of observing the current data or more extreme data under the assumption that H₀ is true. α is the threshold we set beforehand. Decision rule: if P-value < α, reject H₀. Note that the P-value is not "the probability that H₀ is true" (that's a Bayesian posterior probability concept).

\text{P-value} = P(T \geq t_{obs} | H_0), \quad \text{Reject } H_0 \text{ if P-value} < \alpha

When to use t-test vs z-test (U-test)?

The key is whether the population variance is known. If population variance σ² is known, use z-test (U-test); if σ² is unknown and needs to be estimated with sample variance, use t-test. In practice, population variance is usually unknown, so t-test is more common. When sample size is large (n > 30), the t-distribution approximates the normal distribution, and results are similar.

U = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \sim N(0,1) \quad \text{vs} \quad T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t(n-1)

Why did α = 0.05 become the "standard"?

This is mainly a historical convention rather than mathematical necessity. R.A. Fisher proposed 0.05 as a "suspicious" threshold in the 1920s, which later became conventional standard. Actually, the choice of α should be based on field characteristics and error costs: medical research often uses 0.01 (more strict), exploratory research may use 0.10 (more relaxed). Importantly, determine α before data collection and clearly state it in reports.

Historical Note: Fisher originally described 0.05 as a "convenient approximation," not an absolute standard

What is the connection between hypothesis testing and confidence intervals?

They have a precise duality relationship. At the same significance level, if parameter value θ₀ falls within the (1-α) confidence interval, then we cannot reject H₀: θ = θ₀ at level α testing. Vice versa. Confidence intervals provide more information than hypothesis tests: they not only tell us whether to reject a specific value but also give the range of all plausible parameter values.

\theta_0 \in \text{CI}_{1-\alpha} \Leftrightarrow \text{Fail to reject } H_0: \theta = \theta_0 \text{ at level } \alpha