Connect real experiments to mathematical inference. Learn templates for two-sample, paired, and proportion tests with clear interpretation of p-values and power.
Randomization, control, replication, blinding, blocking
Null/alternative, Type I/II, one- vs two-sided
Interpretation, planning for target power
Two-sample means, paired means, two proportions
Random assignment of subjects to treatment groups eliminates selection bias and ensures that confounding variables are distributed randomly across groups.
A control group receives no treatment or a placebo, providing a baseline for comparison to isolate the effect of the treatment.
Multiple subjects per group increase statistical power and allow estimation of within-group variability.
Single-blind: subjects don't know their treatment; double-blind: neither subjects nor researchers know assignments.
Null Hypothesis (H₀): The default assumption of no effect or no difference
Alternative Hypothesis (H₁): The claim we want to test for
Type I Error (α)
Rejecting H₀ when it's true.
Type II Error (β)
Failing to reject H₀ when it's false.
Compare means of two independent groups:
where
Compare means of paired observations (before/after, matched pairs):
where and degrees of freedom = n-1
Compare proportions from two independent samples:
where (pooled proportion)
P-value: The probability of observing a test statistic as extreme as, or more extreme than, the observed value, assuming the null hypothesis is true.
Interpretation: If p < α (typically 0.05), we reject H₀. If p ≥ α, we fail to reject H₀.
Common Misconception: The p-value is NOT the probability that the null hypothesis is true, nor is it the probability that the alternative hypothesis is false.
Evaluate an educational intervention using paired scores. Compute differences, check normality (roughly), and test mean difference.
Steps: 1) Compute d_i = post_i - pre_i 2) t = mean(d)/[sd(d)/sqrt(n)], df = n-1 3) Report two-sided p and 95% CI for mean(d)
Group A: n₁=25, x̄₁=85, s₁=12; Group B: n₂=30, x̄₂=78, s₂=15
1. SE = √(12²/25 + 15²/30) = √(5.76 + 7.5) = 3.64
2. t = (85-78)/3.64 = 1.92
3. df ≈ 50, p-value ≈ 0.06 (two-tailed)
4. 95% CI: 7 ± 2.01×3.64 = (0.68, 13.32)
Before/after scores: d̄ = 3.2, s_d = 2.8, n = 15
1. t = 3.2/(2.8/√15) = 4.43
2. df = 14, p-value < 0.001 (two-tailed)
3. 95% CI: 3.2 ± 2.145×0.72 = (1.65, 4.75)
Report template: - Design: two-sample (independent) / paired - Assumptions: ... - Estimate (CI): ... - Test: statistic, df, p-value - Effect size: Cohen's d / difference in proportions - Practical meaning: ...
Given: Group 1: n₁=20, x̄₁=45, s₁=8; Group 2: n₂=25, x̄₂=38, s₂=10
Test: H₀: μ₁ = μ₂ vs H₁: μ₁ ≠ μ₂
Calculate standard error:
Calculate t-statistic:
Find p-value and make decision:
df ≈ 40, p-value ≈ 0.012 (two-tailed)
Conclusion: Since p < 0.05, we reject H₀. There is significant evidence that the means differ.
Given: Paired data: d̄ = 2.5, s_d = 1.8, n = 12
Test: H₀: μ_d = 0 vs H₁: μ_d > 0
Calculate t-statistic:
Find p-value and make decision:
df = 11, p-value < 0.001 (one-tailed)
Conclusion: Since p < 0.001, we reject H₀. There is strong evidence that the mean difference is greater than zero.
Given: p̂₁ = 0.6 (n₁=100), p̂₂ = 0.45 (n₂=120)
Test: H₀: p₁ = p₂ vs H₁: p₁ ≠ p₂
Calculate pooled proportion:
Calculate z-statistic:
Find p-value and make decision:
p-value ≈ 0.024 (two-tailed)
Conclusion: Since p < 0.05, we reject H₀. There is significant evidence that the proportions differ.
Power (1 - β): The probability of correctly rejecting the null hypothesis when it's false.
Factors affecting power:
Power calculation for two-sample t-test:
Before conducting an experiment, determine the required sample size to achieve desired power:
For two-sample t-test:
where δ is the minimum detectable effect size, σ is the standard deviation, and z values correspond to desired α and β levels.
Measures the standardized difference between two means:
Small Effect
d = 0.2
Medium Effect
d = 0.5
Large Effect
d = 0.8
Report confidence intervals alongside p-values to provide information about precision and practical significance:
95% CI for difference in means:
Interpretation: If the confidence interval excludes zero, it provides evidence for a significant difference. The width of the interval indicates the precision of the estimate.
Conducting multiple tests without adjusting significance levels inflates Type I error rates. Use methods like Bonferroni correction or false discovery rate control when testing multiple hypotheses.
A statistically significant result (p < 0.05) doesn't guarantee practical importance. Always consider effect sizes and confidence intervals to assess practical significance.
Forming hypotheses after seeing the data (HARKing) invalidates the statistical framework. Hypotheses should be stated before data collection and analysis.
Hypothesis testing provides a systematic framework for evaluating claims using data. The process involves stating hypotheses, choosing appropriate tests, computing test statistics, finding p-values, making decisions, and reporting confidence intervals with effect sizes.
Key principles: Proper experimental design (randomization, controls, replication, blinding) is essential for valid inference. Always consider both statistical and practical significance, and be aware of common pitfalls like multiple testing and p-hacking.
Best practices: Plan sample sizes based on power analysis, report effect sizes and confidence intervals, and interpret results in the context of the research question and domain knowledge.