Data Dredging: Torturing the Data Until It Confesses
In 2012, the science webcomic xkcd ran a strip that became a staple of methods courses. Scientists investigate whether jelly beans cause acne. The first test — "jelly beans cause acne?" — comes back negative. They try again with specific colours: green jelly beans? No. Purple? No. Brown? No. They try twenty colours in total. Green jelly beans — that's the one. P = 0.05. The next day's newspaper headline: "Green Jelly Beans Linked to Acne! 95% Confidence! Science Confirms!" The strip is funny because it is entirely accurate. When you perform twenty independent tests at a 5% significance threshold, you expect one false positive by chance alone. The twenty-first positive result looks like a discovery. It is noise.
The Probability Arithmetic of Multiple Testing
The logic of statistical significance testing is built on a single-test assumption: before collecting data, you formulate a hypothesis, collect data to test it, and a p-value below 0.05 means there is less than a 5% chance of observing this result (or something more extreme) if the null hypothesis were true. That 5% is your false positive rate — your probability of a spurious finding — for a single test.
But probability compounds. If you test twenty independent hypotheses on the same dataset, each at p < 0.05, the probability of obtaining at least one false positive is not 5% — it is approximately 64%. With fifty tests, it is over 90%. This is the familywise error rate, and it renders the p-value meaningless when many tests are conducted without adjustment.
The standard correction for multiple comparisons is the Bonferroni correction: divide the significance threshold by the number of tests. If you run twenty tests, you should require p < 0.0025 for any individual result to be considered significant. There are more powerful alternatives (the Benjamini-Hochberg procedure for controlling the false discovery rate is widely used in genomics), but all share the underlying logic: the threshold for what counts as surprising must become more stringent as more hypotheses are evaluated.
The Many Faces of Dredging
Subgroup Analysis
A clinical trial tests a new drug and finds no overall effect. The researchers then slice the data: does it work better in women? In older patients? In patients with a particular genetic variant? In patients from a certain geographic region? In patients who were diagnosed within six months of starting treatment? Each of these subgroup analyses is another hypothesis test. If twenty subgroups are examined and one shows a significant benefit, that finding will typically be the one reported — and the nineteen null results will be quietly omitted. The drug may be approved for that subgroup on the basis of what is, statistically, a coin flip.
The ISIS-2 trial, testing aspirin and streptokinase in heart attack patients, famously found a significant overall benefit from aspirin. A subgroup analysis was then run: did it work in patients born under Gemini or Libra? It did not. The finding was included as a deliberate demonstration that subgroup analyses, however "significant," can be meaningless. Not all researchers are as candid about the absurdity of their own subgroup fishing.
Outcome Switching
Pre-registered clinical trials specify their primary outcome — the main thing they are measuring — before data collection begins. But many trials are not pre-registered, or are pre-registered loosely, and some researchers measure a dozen potential outcomes (pain scores, quality of life, biomarkers, secondary endpoints) and report whichever ones came out significant. This is outcome switching, and it produces findings that look precisely like pre-specified primary results but are actually post-hoc selections from a menu of possibilities.
A landmark 2012 study by Nosek, Spies, and Motyl found that when researchers were asked to report their analytic decisions (how many dependent variables were collected, whether any were dropped, whether the sample size was decided in advance or adjusted after inspection), the presence of analytic flexibility was strongly associated with lower p-values. The flexibility itself — the room to make choices that affect the result — inflated the apparent significance of findings.
HARKing: Hypothesising After Results Are Known
Perhaps the most culturally embedded form of data dredging is HARKing — Hypothesising After Results are Known. A researcher explores a dataset, finds a pattern, and then writes up the paper as though the hypothesis came first. The analysis section reads "we predicted that X would be associated with Y"; what actually happened was "we looked at dozens of relationships and X-Y happened to be significant." The published paper is formally indistinguishable from a genuinely confirmatory study — but it carries none of that study's evidential weight.
HARKing is not always conscious deception. Many researchers genuinely convince themselves that the pattern they found was what they were always looking for; the exploratory process gets retconned into a prediction. This is motivated reasoning operating on memory itself. The result is a scientific literature full of papers that report confirmatory findings from what were actually exploratory analyses.
The Garden of Forking Paths
Statistician Andrew Gelman and methodologist Eric Loken coined the phrase "garden of forking paths" to describe a subtle but important variant of data dredging. Even a researcher who never consciously tests multiple hypotheses faces a vast space of analytical choices: which observations to exclude as outliers, which covariates to include in the regression, whether to log-transform a variable, how to handle missing data, which time window to use, whether to run parametric or non-parametric tests. Each of these choices is a fork in the path, and the analyst's decisions are typically influenced — often unconsciously — by whether the result is "moving in the right direction."
The garden of forking paths does not require a cynical researcher deliberately p-hacking. A well-intentioned scientist making seemingly reasonable analytic decisions at each fork, guided by theoretical intuition and a desire for clean results, can easily arrive at a p < 0.05 finding that is no more reliable than the green jelly bean result. The degrees of freedom in data analysis are enormous, and they are almost never fully disclosed.
Structural Solutions
The scientific community has developed several structural responses to data dredging:
- Pre-registration: Specifying hypotheses, sample sizes, primary outcomes, and analytic procedures before data collection removes the ability to make post-hoc choices that inflate significance. Pre-registration does not prevent exploratory analysis — it just clearly labels it as exploratory.
- Registered Reports: Journals that peer-review studies before data collection and commit to publish the results regardless of outcome. This removes the publication incentive that drives fishing.
- Open data and code: Making raw data and analytic code publicly available allows others to verify that the reported analysis is what was actually done, and to check for analytic choices that were not disclosed.
- Multiple comparison corrections: Applying Bonferroni, Benjamini-Hochberg, or similar adjustments when many tests are run, and reporting all tests conducted — not just the significant ones.
- Replication: Treating initial positive results as hypotheses to be tested in independent samples, rather than as established findings. A result that holds in a pre-registered replication is far more credible than one that originated from a dredging process.
These measures share a common logic: they move the goalposts back to before the data were seen, making it impossible to claim credit for finding what you were actually hunting for with a flashlight in a dark room. The p-hacking and publication bias problems are closely related: dredging produces inflated effects, and publication bias ensures that the inflated effects are the ones that get into journals, while the null results stay in file drawers. Together they constitute a systematic machine for manufacturing false knowledge at scale.
The Tortured Data Problem in Practice
It is worth being concrete about how common these practices are. A 2012 survey of 2,155 psychologists by Leslie John and colleagues found that the majority reported having engaged in at least one questionable research practice: 58% had decided whether to collect more data after looking at whether results were significant; 43% had reported unexpected findings as though they had been predicted; 35% had reported results from only some of the outcome measures they had collected. These are not marginal researchers — they are a majority of the field, and there is no reason to think psychology is unique in this regard.
The phrase "torture the data until it confesses" is often attributed to Ronald Coase (though the exact attribution is disputed), and it captures the dynamic perfectly: given enough analytical flexibility, almost any dataset will eventually yield a significant finding. The question is whether that finding means anything. In most cases of data dredging, the honest answer is: probably not. The confession was extracted under duress.
Sources & Further Reading
- Simmons, J. P., Nelson, L. D., & Simonsohn, U. "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant." Psychological Science 22, no. 11 (2011): 1359–1366.
- Gelman, A., & Loken, E. "The Statistical Crisis in Science." American Scientist 102, no. 6 (2014): 460–465.
- John, L. K., Loewenstein, G., & Prelec, D. "Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling." Psychological Science 23, no. 5 (2012): 524–532.
- Munafo, M. R., et al. "A Manifesto for Reproducible Science." Nature Human Behaviour 1 (2017): 0021.
- Head, M. L., et al. "The Extent and Consequences of P-Hacking in Science." PLOS Biology 13, no. 3 (2015): e1002106.
- xkcd #882: Significant (Green Jelly Beans)
- Wikipedia: Data dredging