Lesson 3-2: Experimental Design & Hypothesis Testing

Experimental Design & Hypothesis Testing

Connect real experiments to mathematical inference. Learn templates for two-sample, paired, and proportion tests with clear interpretation of p-values and power.

Learning Objectives

Design Principles

Randomization, control, replication, blinding, blocking

Hypotheses & Errors

Null/alternative, Type I/II, one- vs two-sided

p-Values & Power

Interpretation, planning for target power

Test Templates

Two-sample means, paired means, two proportions

Core Knowledge Points

Experimental Design Principles

Randomization

Random assignment of subjects to treatment groups eliminates selection bias and ensures that confounding variables are distributed randomly across groups.

Control Groups

A control group receives no treatment or a placebo, providing a baseline for comparison to isolate the effect of the treatment.

Replication

Multiple subjects per group increase statistical power and allow estimation of within-group variability.

Blinding

Single-blind: subjects don't know their treatment; double-blind: neither subjects nor researchers know assignments.

Hypothesis Testing Framework

Null and Alternative Hypotheses

Null Hypothesis (H₀): The default assumption of no effect or no difference

Alternative Hypothesis (H₁): The claim we want to test for

Type I and Type II Errors

Type I Error (α)

Rejecting H₀ when it's true. $P(T_1) = \alpha$

Type II Error (β)

Failing to reject H₀ when it's false. $P(T_2) = \beta$

Statistical Test Procedures

Two-Sample t-Test

Compare means of two independent groups:

t = \frac{\bar{x}_1 - \bar{x}_2}{SE}

where $SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$

Paired t-Test

Compare means of paired observations (before/after, matched pairs):

t = \frac{\bar{d}}{s_d/\sqrt{n}}

where $d_i = x_{i,2} - x_{i,1}$ and degrees of freedom = n-1

Two-Proportion z-Test

Compare proportions from two independent samples:

z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}

where $\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$ (pooled proportion)

P-values and Statistical Significance

P-value: The probability of observing a test statistic as extreme as, or more extreme than, the observed value, assuming the null hypothesis is true.

Interpretation: If p < α (typically 0.05), we reject H₀. If p ≥ α, we fail to reject H₀.

Common Misconception: The p-value is NOT the probability that the null hypothesis is true, nor is it the probability that the alternative hypothesis is false.

Worked Example — Two-Sample Means

• Group A: n1=40, mean=15, s1=4; Group B: n2=45, mean=12, s2=3
• SE ≈ $\sqrt{4^2/40 + 3^2/45} \approx 0.775$
• z ≈ (15 − 12)/0.775 ≈ 3.87 → two-sided p-value ≪ 0.01 → reject H0

Case Study — Paired Pre/Post

Evaluate an educational intervention using paired scores. Compute differences, check normality (roughly), and test mean difference.

Steps:
1) Compute d_i = post_i - pre_i
2) t = mean(d)/[sd(d)/sqrt(n)], df = n-1
3) Report two-sided p and 95% CI for mean(d)

Practice Problems

Compute two-sample mean test statistic and p-value with given summaries.
Paired design: compute t for differences and interpret.
Two proportions: compute pooled SE and z-statistic.

Detailed Examples

Example 1: Two-Sample t-test

Group A: n₁=25, x̄₁=85, s₁=12; Group B: n₂=30, x̄₂=78, s₂=15

1. SE = √(12²/25 + 15²/30) = √(5.76 + 7.5) = 3.64

2. t = (85-78)/3.64 = 1.92

3. df ≈ 50, p-value ≈ 0.06 (two-tailed)

4. 95% CI: 7 ± 2.01×3.64 = (0.68, 13.32)

Example 2: Paired t-test

Before/after scores: d̄ = 3.2, s_d = 2.8, n = 15

1. t = 3.2/(2.8/√15) = 4.43

2. df = 14, p-value < 0.001 (two-tailed)

3. 95% CI: 3.2 ± 2.145×0.72 = (1.65, 4.75)

Assumptions and Robustness

• Independence within/between groups; approximate normality for small n.
• Large-sample z approximations improve with n (CLT intuition).
• Use nonparametric alternatives when assumptions fail (preview).

Effect Sizes and Reporting Template

Report template:
- Design: two-sample (independent) / paired
- Assumptions: ...
- Estimate (CI): ...
- Test: statistic, df, p-value
- Effect size: Cohen's d / difference in proportions
- Practical meaning: ...

Practice Problems

Two-Sample t-Test

Given: Group 1: n₁=20, x̄₁=45, s₁=8; Group 2: n₂=25, x̄₂=38, s₂=10

Test: H₀: μ₁ = μ₂ vs H₁: μ₁ ≠ μ₂

Solution

Calculate standard error:

SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{8^2}{20} + \frac{10^2}{25}} = \sqrt{3.2 + 4} = \sqrt{7.2} \approx 2.68

Calculate t-statistic:

t = \frac{\bar{x}_1 - \bar{x}_2}{SE} = \frac{45 - 38}{2.68} = \frac{7}{2.68} \approx 2.61

Find p-value and make decision:

df ≈ 40, p-value ≈ 0.012 (two-tailed)

Conclusion: Since p < 0.05, we reject H₀. There is significant evidence that the means differ.

Paired t-Test

Given: Paired data: d̄ = 2.5, s_d = 1.8, n = 12

Test: H₀: μ_d = 0 vs H₁: μ_d > 0

Solution

Calculate t-statistic:

t = \frac{\bar{d}}{s_d/\sqrt{n}} = \frac{2.5}{1.8/\sqrt{12}} = \frac{2.5}{0.52} \approx 4.81

Find p-value and make decision:

df = 11, p-value < 0.001 (one-tailed)

Conclusion: Since p < 0.001, we reject H₀. There is strong evidence that the mean difference is greater than zero.

Two-Proportion z-Test

Given: p̂₁ = 0.6 (n₁=100), p̂₂ = 0.45 (n₂=120)

Test: H₀: p₁ = p₂ vs H₁: p₁ ≠ p₂

Solution

Calculate pooled proportion:

\hat{p} = \frac{x_1 + x_2}{n_1 + n_2} = \frac{60 + 54}{100 + 120} = \frac{114}{220} \approx 0.518

Calculate z-statistic:

z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}} = \frac{0.6 - 0.45}{\sqrt{0.518(0.482)(\frac{1}{100} + \frac{1}{120})}} \approx 2.25

Find p-value and make decision:

p-value ≈ 0.024 (two-tailed)

Conclusion: Since p < 0.05, we reject H₀. There is significant evidence that the proportions differ.

Advanced Insights

• Power analysis basics: $1-\beta$ depends on effect size, variability, n, and α.
• Multiple testing and false discovery rate (preview concepts).
• Equivalence and non-inferiority tests for practical sameness.

Common Pitfalls

• Interpreting p-value as the probability the null is true.
• Ignoring practical significance by reporting p only.
• HARKing: forming hypotheses after results are known.

Power Analysis and Sample Size Planning

Statistical Power

Power (1 - β): The probability of correctly rejecting the null hypothesis when it's false.

Factors affecting power:

• Effect size: Larger effects are easier to detect
• Sample size: More data increases power
• Significance level (α): Higher α increases power but also Type I error
• Variability: Less variability increases power

Power calculation for two-sample t-test:

\text{Power} = 1 - \beta = P\left(\frac{\bar{X}_1 - \bar{X}_2}{SE} > t_{\alpha/2, df} \mid \mu_1 \neq \mu_2\right)

Sample Size Planning

Before conducting an experiment, determine the required sample size to achieve desired power:

For two-sample t-test:

n = \frac{2\sigma^2(z_{\alpha/2} + z_{\beta})^2}{\delta^2}

where δ is the minimum detectable effect size, σ is the standard deviation, and z values correspond to desired α and β levels.

Effect Sizes and Practical Significance

Cohen's d (Standardized Effect Size)

Measures the standardized difference between two means:

d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}

Small Effect

d = 0.2

Medium Effect

d = 0.5

Large Effect

d = 0.8

Confidence Intervals for Effect Sizes

Report confidence intervals alongside p-values to provide information about precision and practical significance:

95% CI for difference in means:

(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2, df} \cdot SE

Interpretation: If the confidence interval excludes zero, it provides evidence for a significant difference. The width of the interval indicates the precision of the estimate.

Common Pitfalls and Misconceptions

P-hacking and Multiple Testing

Conducting multiple tests without adjusting significance levels inflates Type I error rates. Use methods like Bonferroni correction or false discovery rate control when testing multiple hypotheses.

Confusing Statistical and Practical Significance

A statistically significant result (p < 0.05) doesn't guarantee practical importance. Always consider effect sizes and confidence intervals to assess practical significance.

Post-hoc Hypothesis Formation

Forming hypotheses after seeing the data (HARKing) invalidates the statistical framework. Hypotheses should be stated before data collection and analysis.

Real-World Applications

Medical Research

• Clinical trials for drug efficacy
• A/B testing for treatment protocols
• Epidemiological studies
• Quality control in medical devices

Business and Marketing

• Website conversion rate optimization
• Product feature testing
• Customer satisfaction surveys
• Pricing strategy experiments

Education and Psychology

• Educational intervention studies
• Learning method comparisons
• Behavioral psychology experiments
• Assessment tool validation

Technology and Engineering

• Algorithm performance testing
• System reliability studies
• User interface design testing
• Manufacturing process optimization

Summary

Hypothesis testing provides a systematic framework for evaluating claims using data. The process involves stating hypotheses, choosing appropriate tests, computing test statistics, finding p-values, making decisions, and reporting confidence intervals with effect sizes.

Key principles: Proper experimental design (randomization, controls, replication, blinding) is essential for valid inference. Always consider both statistical and practical significance, and be aware of common pitfalls like multiple testing and p-hacking.

Best practices: Plan sample sizes based on power analysis, report effect sizes and confidence intervals, and interpret results in the context of the research question and domain knowledge.

← Back to Lesson 3-1 Next: Lesson 3-3 →