blog.category.aspect Mar 29, 2026 7 min read

P-Hacking: Torturing Data Until It Confesses

#blog.tag.aspect #blog.tag.d4_statistical_errors #blog.tag.bok #blog.tag.encyclopedia

In 2011, psychologist Daryl Bem published a peer-reviewed paper in one of psychology's most prestigious journals claiming that humans could sense the future. His experiments showed statistically significant evidence for precognition — people performed slightly better than chance at predicting where erotic images would appear on a screen before the images were chosen. The p-values were below 0.05. The methodology looked standard. The paper passed peer review. And it was almost certainly wrong — not because Bem was dishonest, but because of a statistical trap that quietly distorts a vast portion of published science: p-hacking.

What Is a P-Value — And Why Does It Break?

A p-value is the probability of observing a result at least as extreme as the one you got, assuming the null hypothesis is true — that is, assuming there is no real effect. By convention, scientists typically treat p < 0.05 as "statistically significant": a result that would occur by chance less than 5% of the time if the null hypothesis held.

The threshold sounds conservative. It isn't. Here's the problem: if you run 20 independent tests on random data, you expect one of them to cross the p < 0.05 threshold by chance alone. If you run 100 tests, you expect five false positives. The p-value threshold was designed for a world where a researcher forms a hypothesis, runs a single pre-specified test, and reports the result — whatever it is. That world bears little resemblance to how science is often actually conducted.

P-hacking is the exploitation of researcher flexibility to keep testing until significance appears. It is rarely a deliberate fraud. More often it is the product of normal scientific curiosity — trying different subgroups, different statistical tests, adding or removing covariates, collecting a few more data points — combined with a publication system that rewards positive results and a cognitive tendency to stop when the answer looks like the one you hoped for.

The Mechanics of Data Dredging

The specific techniques that produce p-hacking are mundane and individually defensible:

Optional stopping: Collecting data and testing repeatedly until p < 0.05 appears, then stopping. Each additional test inflates the false positive rate beyond 5%.
Outcome switching: Pre-registering one primary outcome, finding it non-significant, then reporting a different variable that happened to be significant as if it were the original hypothesis.
Subgroup fishing: Running an analysis on the full sample, finding no effect, then testing subgroups (men only, women only, under-30s, high-income participants) until a significant slice appears.
Covariate manipulation: Adding or removing control variables from a regression model until the target coefficient crosses the significance threshold.
Flexible exclusions: Re-running analyses with and without outliers, choosing the version that produces the desired result.

Simmons, Nelson, and Simonsohn demonstrated in a famous 2011 paper in Psychological Science that using just four of these researcher degrees of freedom — sample size flexibility, choice of dependent variable, inclusion of covariates, and exclusion of subjects — can inflate the false positive rate from the nominal 5% to over 60%. In other words, a researcher following no rules at all could produce "significant" results the majority of the time from pure noise.

The Replication Crisis

The empirical cost of p-hacking became visible in 2015, when the Open Science Collaboration published a landmark attempt to replicate 100 published psychology studies. Only 39% of the replications showed a statistically significant result in the same direction as the original. The average effect size in replications was roughly half that reported in the originals. The paper, published in Science, triggered a field-wide reckoning that spread to medicine, economics, neuroscience, and nutrition research.

A large-scale replication effort in social psychology found that a shocking 64% of originally "significant" findings were not significant when repeated by independent teams. In cancer biology, a replication effort found that only 11 of 53 landmark studies could be successfully replicated. The implications are not abstract: clinical guidelines, public health policy, and business practices are built on findings that may be statistical artifacts.

The problem is systemic, not individual. Publication bias — journals preferring positive results — creates pressure to produce them. Careers depend on publication counts. Funders favour novel findings. Peer reviewers rarely check raw data. The result is a literature substantially inflated with effects that are either much smaller than reported or entirely fictional.

Classic Case Studies

Beyond Bem's precognition study, some of the most cited examples of p-hacking consequences include:

The "ego depletion" effect — the idea that willpower is a limited resource, supported by hundreds of published studies — failed to replicate in a pre-registered multi-lab test across 23 laboratories in 2016.
Power posing — Amy Cuddy's claim that holding expansive body postures raises testosterone and lowers cortisol — became a global TED phenomenon. Independent replications found no effect on hormones; the original result was a statistical artifact of flexible analysis.
Nutritional epidemiology has been systematically shown to produce contradictory findings, in part because studies mining large food-frequency questionnaires can find "significant" associations between nearly any food and nearly any health outcome given enough variables to test.

The 5% Solution That Isn't

A common misunderstanding compounds the p-hacking problem: even a genuine p < 0.05 result does not mean there is a 95% probability the effect is real. The probability that a significant result reflects a true effect depends heavily on the prior probability that the hypothesis was correct in the first place — a calculation that most researchers and readers don't perform. This is the Base Rate Fallacy applied to statistical testing: ignoring how unlikely most hypotheses are before the data even arrive.

If only 10% of tested hypotheses are true in a field that runs many exploratory studies, and false positive rate is 5% while statistical power is 80%, then more than a third of "significant" results in that field are false positives — even without any deliberate p-hacking. Add researcher flexibility, and the proportion climbs further. John Ioannidis made this point in a provocative 2005 paper titled "Why Most Published Research Findings Are False" — a claim that subsequent evidence has substantially supported in many fields.

Defences Against Data Dredging

The research community has developed several countermeasures, with varying adoption:

Pre-registration: Researchers publicly specify their hypotheses, sample sizes, and analysis plans before collecting data, making post-hoc fishing visible. Pre-registered studies show substantially smaller effect sizes on average than non-pre-registered ones.
Registered Reports: A publication format in which journals commit to publish the study regardless of the result, based on peer review of the protocol before data collection. This eliminates publication bias at the source.
Multiple comparison corrections: Statistical adjustments (Bonferroni, Benjamini-Hochberg) that raise the significance threshold when many tests are conducted, reducing false positive rates.
Open data and materials: Sharing raw data and analysis code allows independent replication and makes analytical flexibility visible.
Bayesian methods: Replacing p-values with Bayes factors or posterior probabilities incorporates prior probability and avoids the arbitrary significance threshold that p-hacking exploits.

Why It Persists

Despite growing awareness, p-hacking persists because the incentive structures that produce it remain largely intact. Hiring committees still count publications. High-profile journals still preferentially publish surprising, positive results. Funders still want breakthroughs. Confirmation bias makes it genuinely difficult for researchers to notice when their analysis choices are following results rather than leading them. The flexibility that enables p-hacking often feels like thoroughness — checking multiple angles, being careful about outliers, controlling for obvious confounds.

The cure is structural as much as individual. Recognising p-hacking as a pattern is the first step toward demanding more from the research you encounter: Was this pre-registered? Have others replicated it? How large is the effect size, not just the significance? A p-value below 0.05 is a starting point for inquiry, not a conclusion. Treating it as the latter is precisely how data ends up confessing to things it never actually knew.

Related Concepts

P-hacking rarely acts alone. It is amplified by confirmation bias — the tendency to seek evidence that confirms prior beliefs — and by apophenia, the tendency to perceive meaningful patterns in random data. The false positives it generates feed availability heuristics in the public sphere: dramatic study results become memorable, while quiet replications that fail go unreported and unremembered. Understanding p-hacking is inseparable from understanding why ghost variables and base rates matter — together, they form the core of statistical reasoning under uncertainty.

Sources

Open Science Collaboration (2015). "Estimating the reproducibility of psychological science." Science, 349(6251). doi:10.1126/science.aac4716
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). "False-positive psychology." Psychological Science, 22(11), 1359–1366.
Ioannidis, J. P. A. (2005). "Why most published research findings are false." PLOS Medicine, 2(8), e124.
Gelman, A., & Loken, E. (2014). "The statistical crisis in science." American Scientist, 102(6), 460.
Head, M. L., et al. (2015). "The extent and consequences of p-hacking in science." PLOS Biology, 13(3), e1002106.