Data Dredging (Fishing Expedition)

Also Known As: fishing expedition HARKing (Hypothesizing After Results are Known) post-hoc analysis disguised as a priori

Statistical Error ID: data_dredging

Definition

Data dredging is the practice of exhaustively searching through data for any statistically significant patterns without a prior hypothesis, then presenting discovered patterns as if they were predicted in advance. While exploratory data analysis is legitimate when labeled as such, data dredging crosses the line by disguising exploratory findings as confirmatory results. The sheer number of possible correlations in any dataset virtually guarantees that some will pass significance thresholds by chance alone.

Examples

A researcher has access to a large health database with 500 variables. After testing all 124,750 possible pairwise correlations, they find that ice cream consumption is significantly correlated with drowning deaths. They publish this as a confirmed finding without mentioning it was one of 125,000 tests or that both variables are driven by warm weather.

A marketing team records 300 customer attributes and tests all combinations against purchase behavior. They announce a breakthrough finding: customers who prefer blue packaging and own a pet are 40% more likely to buy — a result almost certainly due to chance, with no theoretical basis and no replication attempt.

A political scientist downloads decades of county-level data with hundreds of economic and social indicators, then runs thousands of regressions until finding that per-capita bowling alley count significantly predicts voter turnout. The finding is published as a novel discovery without acknowledging the exhaustive search that produced it.

Verification Steps

Verification Steps

Binary yes/no questions that an AI must answer to detect a reasoning pattern in a text.

Each of the 452 aspects has verification steps — simple yes/no questions designed to systematically detect whether a pattern appears in a text. For ad hominem: "Does the argument attack a person rather than their claim?" For false dichotomy: "Are only two options presented when more exist?" This ensures consistent, reproducible analysis.

View in glossary →

Binary (yes/no) questions an LLM must answer to identify this aspect:

1

Were the hypotheses formulated before or after examining the data?
Type: binary
2

Were many comparisons or subgroup analyses performed?
Type: binary
3

Are exploratory findings being presented as if they were hypothesis-driven?
Type: binary
4

Have the findings been replicated in an independent dataset?
Type: binary

Description

Why It Works

The published result looks identical to a hypothesis-driven finding: clean data, clear statistical test, significant p-value. The reader has no way to know how many tests preceded the reported one.

How to Counter

Distinguish between exploratory and confirmatory analyses. Require replication on independent data for any dredged finding. Apply multiple comparison corrections appropriate to the number of tests actually conducted.

Also Known As

fishing expedition HARKing (Hypothesizing After Results are Known) post-hoc analysis disguised as a priori

Real-World Context

Data dredging is facilitated by big data and machine learning, where massive datasets make spurious correlations inevitable. The website 'Spurious Correlations' by Tyler Vigen illustrates the absurdity of uncritical data mining.