P-Values Don't Mean What You Think They Mean

The Statistic That Even Researchers Get Wrong

I got through an entire statistics course thinking p < 0.05 meant "probably true." It doesn't. Not even close.

A 2016 survey by the American Statistical Association found that the majority of researchers — people who use p-values daily — couldn't correctly define what one means. Some thought it was the probability the result is due to chance. Others thought it measured the probability the hypothesis is true. Both wrong.

Here's what a p-value actually is: the probability of seeing data this extreme (or more extreme) if the null hypothesis were true. Read that again. It's not about your hypothesis being right. It's about how surprising your data would be in a world where nothing interesting is happening.

The Coin Flip That Explains Everything

Forget formulas for a minute. Grab a coin.

Your null hypothesis: this coin is fair (50/50). You flip it 20 times and get 15 heads. Suspicious? Maybe. But how suspicious?

The p-value answers: "If this coin really is fair, what's the probability of getting 15 or more heads in 20 flips?" Run the binomial math and you get p = 0.021. About a 2.1% chance.

That 0.021 doesn't mean there's a 2.1% chance the coin is fair. It means: if the coin is fair, you'd see a result this lopsided only about 2.1% of the time. The p-value assumes the null hypothesis is true and asks how weird your data looks under that assumption.

Since 0.021 < 0.05, most researchers would "reject the null" and conclude the coin is probably loaded. But "probably" is doing a lot of heavy lifting in that sentence.

Why 0.05? (It's More Arbitrary Than You Think)

Ronald Fisher — the statistician who popularized significance testing in the 1920s — once wrote that 0.05 was a "convenient" threshold. Not sacred. Not mathematically derived. Convenient.

It stuck. Journals adopted it. Funding agencies required it. Entire careers now hinge on whether a number lands at 0.049 or 0.051. The difference between "publishable" and "file drawer" is often a rounding error.

Some fields have started pushing back. Particle physics uses p < 0.0000003 (the "5-sigma" standard) before claiming a discovery. Genomics uses even stricter thresholds because they're testing thousands of hypotheses simultaneously. Meanwhile, psychology and social science are still arguing about whether 0.05 is too lenient.

The replication crisis — where landmark studies failed to reproduce — wasn't caused by p-values alone. But the obsession with crossing the 0.05 line created perverse incentives: p-hacking, selective reporting, and "HARKing" (hypothesizing after results are known). When the threshold becomes the goal, the science suffers.

What a P-Value of 0.03 Actually Tells You (And What It Doesn't)

Let's say you run a study testing whether a new drug lowers blood pressure. You get p = 0.03. Here's the scorecard:

Statement	True or False?
"There's a 3% chance the drug doesn't work"	False
"There's a 97% chance the drug works"	False
"If the drug had no effect, we'd see data this extreme only 3% of the time"	True
"The drug has a large effect"	False — p-values say nothing about effect size

That third row is the only correct interpretation. The p-value lives entirely inside the null hypothesis world. It can't tell you the probability that your alternative hypothesis is true — that requires Bayesian statistics, which is a different conversation entirely.

And here's the part that trips up even experienced researchers: a small p-value with a huge sample size might reflect a real but trivially small effect. A drug that lowers blood pressure by 0.5 mmHg could easily hit p < 0.001 with 50,000 participants. Statistically significant? Yes. Clinically meaningful? Not remotely.

Type I, Type II, and the Error Nobody Talks About

Rejecting the null when it's actually true — that's a Type I error (false positive). The p-value threshold directly controls this: at α = 0.05, you accept a 5% false positive rate.

But there's a mirror image. Failing to reject the null when it's actually false — that's a Type II error (false negative). The probability of avoiding this is called statistical power, and most studies are woefully underpowered.

A study with 80% power (the commonly recommended minimum) still misses real effects 20% of the time. Many published studies have power closer to 50%, which means they're basically coin flips for detecting true effects. Your p-value might be 0.03, but if the study only had 40 participants, the confidence you should place in that result is... limited.

The relationship between these concepts matters more than any single number. A misleading percentage can distort your understanding of data just as badly as a misinterpreted p-value.

So What Should You Actually Look At?

P-values aren't useless. They're just one piece of a bigger picture. Here's what a responsible analysis includes:

Effect Size

How big is the difference? Cohen's d, odds ratios, correlation coefficients — these tell you whether the result matters practically, not just statistically.

Confidence Interval

A range of plausible values for the true effect. A 95% CI of [0.1, 15.2] tells you much more than "p = 0.04" alone — the effect could be tiny or huge.

The choice of which average to report can be just as misleading as cherry-picking a p-value. Statistics is full of decisions that shape the story your data tells.

Next time you read "the results were statistically significant," ask two questions: How big was the effect? And how many people were in the study? If the paper doesn't answer both, the p-value alone isn't telling you much.

Frequently Asked Questions

What does a p-value of 0.05 mean?

It means that if the null hypothesis is true (no real effect), you'd see data this extreme about 5% of the time by random chance alone. It does not mean there's a 5% chance the null hypothesis is true, or a 95% chance your result is correct.

Why is p < 0.05 the standard threshold?

Historical convention, not mathematical necessity. Ronald Fisher suggested it as a reasonable cutoff in the 1920s, and it became entrenched in academic publishing. Different fields use different thresholds — particle physics requires p < 0.0000003 for discovery claims.

What's the difference between a p-value and a confidence interval?

A p-value gives you a single yes/no decision point. A confidence interval gives you a range of plausible values for the true effect. A 95% CI that doesn't include zero corresponds to p < 0.05, but the interval also shows you how precise your estimate is and how large the effect might be.

Run Your Own Significance Test

Got a z-score or t-statistic? Plug it in and see where it lands. No coin flipping required.

*Remember: statistical significance ≠ practical significance.