🧪 This platform is in early beta. Features may change and you might encounter bugs. We appreciate your patience!
data_dredging
Data dredging is the practice of exhaustively searching through data for any statistically significant patterns without a prior hypothesis, then presenting discovered patterns as if they were predicted in advance. While exploratory data analysis is legitimate when labeled as such, data dredging crosses the line by disguising exploratory findings as confirmatory results. The sheer number of possible correlations in any dataset virtually guarantees that some will pass significance thresholds by chance alone.
A researcher has access to a large health database with 500 variables. After testing all 124,750 possible pairwise correlations, they find that ice cream consumption is significantly correlated with drowning deaths. They publish this as a confirmed finding without mentioning it was one of 125,000 tests or that both variables are driven by warm weather.
A marketing team records 300 customer attributes and tests all combinations against purchase behavior. They announce a breakthrough finding: customers who prefer blue packaging and own a pet are 40% more likely to buy — a result almost certainly due to chance, with no theoretical basis and no replication attempt.
A political scientist downloads decades of county-level data with hundreds of economic and social indicators, then runs thousands of regressions until finding that per-capita bowling alley count significantly predicts voter turnout. The finding is published as a novel discovery without acknowledging the exhaustive search that produced it.
Binary (yes/no) questions an LLM must answer to identify this aspect:
Were the hypotheses formulated before or after examining the data?
Type: binaryWere many comparisons or subgroup analyses performed?
Type: binaryAre exploratory findings being presented as if they were hypothesis-driven?
Type: binaryHave the findings been replicated in an independent dataset?
Type: binaryData dredging is the practice of exhaustively searching through data for any statistically significant patterns without a prior hypothesis, then presenting discovered patterns as if they were predicted in advance. While exploratory data analysis is legitimate when labeled as such, data dredging crosses the line by disguising exploratory findings as confirmatory results. The sheer number of possible correlations in any dataset virtually guarantees that some will pass significance thresholds by chance alone.
The published result looks identical to a hypothesis-driven finding: clean data, clear statistical test, significant p-value. The reader has no way to know how many tests preceded the reported one.
Distinguish between exploratory and confirmatory analyses. Require replication on independent data for any dredged finding. Apply multiple comparison corrections appropriate to the number of tests actually conducted.
Data dredging is facilitated by big data and machine learning, where massive datasets make spurious correlations inevitable. The website 'Spurious Correlations' by Tyler Vigen illustrates the absurdity of uncritical data mining.
Using information that was not available at the point in time being analyzed.
Presenting post-hoc hypotheses as if they were formulated before seeing the data.
Splitting a single study into multiple publications to inflate publication count.
Use these tools to detect, analyze, or train this aspect.