Multiple Comparisons Problem

Also Known As: Look-Elsewhere Effect Multiple Testing Problem Multiplicity

Discourse Mechanics ID: multiple_comparisons_problem

Definition

The statistical error of performing many tests without adjusting for the increased probability of false positives. With a significance level of 0.05 and 20 independent tests, there is a 64% chance of at least one false positive. Failure to correct for this inflates the apparent number of 'significant' findings.

Examples

A brain imaging study tests 100,000 voxels for activation. At p < 0.05, about 5,000 voxels will appear significant by chance alone, potentially producing spurious 'brain activation' maps.

A nutrition researcher surveys 500 participants on 80 different dietary habits and tests each one for correlation with heart disease risk. At p < 0.05, roughly four associations will appear significant purely by chance. The researcher publishes the 'finding' that eating soup three times a week reduces risk, without correcting for multiple comparisons.

A social media company's data science team runs A/B tests on 200 minor interface variations in a single month, each evaluated at p < 0.05. Statistically, about 10 of those tests will show a 'significant' effect even if none of the changes actually influence user behavior, leading the team to roll out ineffective features confidently.

Verification Steps

Verification Steps

Binary yes/no questions that an AI must answer to detect a reasoning pattern in a text.

Each of the 452 aspects has verification steps — simple yes/no questions designed to systematically detect whether a pattern appears in a text. For ad hominem: "Does the argument attack a person rather than their claim?" For false dichotomy: "Are only two options presented when more exist?" This ensures consistent, reproducible analysis.

View in glossary →

Binary (yes/no) questions an LLM must answer to identify this aspect:

1

Are multiple statistical tests being performed on the same dataset?
Type: binary
2

Is the significance threshold (alpha) applied per-test rather than adjusted for the total number of tests?
Type: binary
3

Does the number of tests substantially increase the probability of at least one false positive?
Type: binary

Description

Why It Works

Each individual test seems legitimate at the 0.05 level. The cumulative false positive rate is counterintuitive because people think about each test in isolation rather than as part of a family of tests.

How to Counter

Apply multiple comparison corrections (Bonferroni, FDR, permutation testing). Pre-register hypotheses to distinguish confirmatory from exploratory analysis.