🧪 This platform is in early beta. Features may change and you might encounter bugs. We appreciate your patience!
multiple_comparisons_problem
The statistical error of performing many tests without adjusting for the increased probability of false positives. With a significance level of 0.05 and 20 independent tests, there is a 64% chance of at least one false positive. Failure to correct for this inflates the apparent number of 'significant' findings.
A brain imaging study tests 100,000 voxels for activation. At p < 0.05, about 5,000 voxels will appear significant by chance alone, potentially producing spurious 'brain activation' maps.
A nutrition researcher surveys 500 participants on 80 different dietary habits and tests each one for correlation with heart disease risk. At p < 0.05, roughly four associations will appear significant purely by chance. The researcher publishes the 'finding' that eating soup three times a week reduces risk, without correcting for multiple comparisons.
A social media company's data science team runs A/B tests on 200 minor interface variations in a single month, each evaluated at p < 0.05. Statistically, about 10 of those tests will show a 'significant' effect even if none of the changes actually influence user behavior, leading the team to roll out ineffective features confidently.
Binary (yes/no) questions an LLM must answer to identify this aspect:
Are multiple statistical tests being performed on the same dataset?
Type: binaryIs the significance threshold (alpha) applied per-test rather than adjusted for the total number of tests?
Type: binaryDoes the number of tests substantially increase the probability of at least one false positive?
Type: binaryThe statistical error of performing many tests without adjusting for the increased probability of false positives. With a significance level of 0.05 and 20 independent tests, there is a 64% chance of at least one false positive. Failure to correct for this inflates the apparent number of 'significant' findings.
Each individual test seems legitimate at the 0.05 level. The cumulative false positive rate is counterintuitive because people think about each test in isolation rather than as part of a family of tests.
Apply multiple comparison corrections (Bonferroni, FDR, permutation testing). Pre-register hypotheses to distinguish confirmatory from exploratory analysis.
Genomics, neuroimaging, clinical trials with multiple endpoints, and any large-scale data analysis.
Use these tools to detect, analyze, or train this aspect.