Lesson 5-2: Hypothesis Testing & Significance

Learning Goals

Set up H0 and Ha with precise parameter statements.
Compute test statistics for mean and proportion scenarios.
Interpret p-values and connect them to α and decisions.
Explain Type I/II errors and their practical costs.

Assumptions

Random sampling and independence (or justified design).
Approximate normality for test statistic (CLT or conditions).
Clear definition of α (e.g., 0.05) before viewing data.

Structure of a Test

Hypotheses

Specify the parameter: mean $\mu$ or proportion $p$ . Then define a null hypothesis $H_0$ and an alternative $H_a$ .

H_0: \; p=p_0 \quad vs \quad H_a: \; p>p_0,\; p<p_0,\; \text{or}\; p\ne p_0

H_0: \; \mu=\mu_0 \quad vs \quad H_a: \; \mu>\mu_0,\; \mu<\mu_0,\; \text{or}\; \mu\ne \mu_0

Workflow

State the parameter and define H0 and Ha (one- or two-sided).
Check conditions and assumptions for the chosen test.
Compute test statistic and p-value.
Compare p-value with significance level α and conclude.
Interpret the decision in the real context.

Test Statistics

Proportion (Large n)

z = \dfrac{ \hat{p} - p_0 }{ \sqrt{ \dfrac{p_0(1-p_0)}{n} } }

Under $H_0$ , replace $p$ by $p_0$ for SE.

Mean (Unknown SD, Large n)

z = \dfrac{ \bar{x} - \mu_0 }{ s/\sqrt{n} }

When n is large, the z approximation is often reasonable by the CLT.

p-Value

The p-value is the probability, under $H_0$ , of seeing a test statistic at least as extreme as the observed one. Smaller p-values provide stronger evidence against $H_0$ .

Example: Proportion, One-Sided

A factory claims defect rate $p \le 0.03$ . In a sample of 200, 10 are defective. Test $H_0: p=0.03$ vs $H_a: p>0.03$ at $\alpha=0.05$ .

\hat{p}=10/200=0.05, \; SE=\sqrt{ \dfrac{0.03\cdot0.97}{200} } \approx 0.0122

z= \dfrac{0.05-0.03}{0.0122} \approx 1.64

The one-sided p-value is about 0.05. At $\alpha=0.05$ , this is borderline; report exact p and context before decisions.

Example: Mean, Two-Sided

Standard lifetime is 1,000 hours. From n=36, $\bar{x}=980$ hours, $s=120$ hours. Test $H_0: \mu=1000$ vs $H_a: \mu \ne 1000$ .

z= \dfrac{980-1000}{120/6} = \dfrac{-20}{20} = -1

Two-sided p-value is approximately 0.317. At $\alpha=0.05$ , fail to reject $H_0$ .

Errors and Test Power

Type I and Type II

Type I: Reject true $H_0$ (probability α).
Type II: Fail to reject false $H_0$ (probability β).
Power: $1-\beta$ ; increases with n or larger true effect size.

Design Considerations

Choose α based on real costs of errors.
Increase n to reduce SE and improve power.
Use one-sided tests only when justified by context.

Guided Practice

Set 1: One-Sided Mean Test

Parameter μ: population mean weight. H₀: μ = 50, Hₐ: μ > 50.
Check: random sample, n=40> 30 for CLT validity.
z = (x̄-50)/(s/√n), p-value = P(Z > z) for upper tail.
If p < 0.05, reject H₀: evidence mean weight exceeds 50.

Set 2: Two-Sided Proportion Test

Parameter p: defect rate. H₀: p = 0.05, Hₐ: p ≠ 0.05.
Check: np₀≥10, n(1-p₀)≥10 with p₀=0.05.
z = (p̂-0.05)/√(0.05×0.95/n), p-value = 2P(Z > |z|).
Two-sided test: reject if p < 0.05, rate differs from 5%.

Set 3: Type I & II Errors

Medical test: H₀: no disease, Hₐ: disease present.
Type I: false positive (healthy→diagnosed sick).
Type II: false negative (sick→diagnosed healthy).
α controls Type I rate; power = 1-β controls Type II.

Set 4: P-value Interpretation

Parameter μ: mean response time. Test H₀: μ = 5.0 seconds.
Compute p-value from sample data and test statistic.
p-value = probability of observing such extreme data given H₀.
Small p-value (< α) provides evidence against H₀.

Set 5: Power Analysis

Test design: detect 10% improvement in success rate.
Calculate required sample size for 80% power at α=0.05.
Power = P(reject H₀ | Hₐ true), depends on effect size and n.
Higher power requires larger sample or less stringent α.

Set 6: Multiple Testing

Three treatments compared: family-wise error rate concern.
Bonferroni correction: use α/3 for each individual test.
Controls overall Type I error rate at 5% level.
Trade-off: reduced power for individual comparisons.

Two-Sample Tests (Means & Proportions)

Proportions

z = \dfrac{ \hat{p}_1-\hat{p}_2 }{ \sqrt{ \hat{p}(1-\hat{p})(1/n_1+1/n_2) } }

Pooled $\hat{p}$ under $H_0: p_1=p_2$ .

Means (Large n)

z = \dfrac{ (\bar{x}_1-\bar{x}_2) - (\mu_1-\mu_2)_0 }{ \sqrt{ s_1^2/n_1 + s_2^2/n_2 } }

Use z-approx when n large; small samples require t with Satterthwaite df.

Power & Sample Size

For a minimally interesting effect size Δ, pick n to achieve target power 1-β at significance α.

Trade-off: larger n → smaller SE → higher power.
Decide one- vs two-sided based on context before seeing data.

Confidence Intervals and Tests

For two-sided α=0.05 tests, rejecting $H_0$ is equivalent to the hypothesized value lying outside the 95% CI.

Practice Bank

Bank A: One-Sample z (Proportion)

State H₀, Hₐ
Compute z and p
Conclude at α=0.05 with context

Bank B: One-Sample z (Mean, large n)

Check CLT
Compute z
Interpret two-sided p

Bank C: Two-Sample Proportions

Pooled p̂ under H₀
z statistic
Decision and CI connection

Bank D: Two-Sample Means

Large n z-approx
Discuss small-n t alternative
Report effect size

Bank E: One-Sided vs Two-Sided

When one-sided is justified
Pre-register direction
Cautions on p-hacking

Bank F: Power & Sample Size

Target Δ and β
Approximate n for 80% power
Trade-offs

Bank G: Multiple Testing

Bonferroni control
FWER vs FDR concepts
Interpretation pitfalls

Bank H: Practical Significance

Compare CI with meaningful thresholds
Distinguish statistical vs practical

Bank I: Assumptions Audit

Randomness/independence
Approximate normality
Outliers/robustness

Bank J: Reporting

Report exact p and CI
Describe effect size
State limitations

FAQ (Extended)

Q: p=0.049 和 p=0.051 有本质区别吗？

不是本质差异；报告精确 p 值与效应大小与区间更有信息量。

Q: 何时使用单侧检验？

仅在研究前就有明确单方向假设且反向没有意义时。

Mini Projects

Project A: Manufacturing Defects

Define p and target p₀
Choose α and tail direction
Collect data and report exact p with CI

Project B: Marketing Uplift

Two-sample proportion test
Compute pooled SE and z
Discuss power for given Δ

Project C: Mean Response Time

One-sample mean test
Check CLT or justify normal
Report effect size (Cohen d)

Project D: A/B Experiment

Random assignment
Pre-register metric and α
Analyse and share reproducible report

Project E: Medical Screening

Type I/II costs
Pick α to balance risks
Include sensitivity analysis