blog.category.aspect Mar 29, 2026 8 min read

Underpowered Study: The Silence That Isn't Absence

#blog.tag.aspect #blog.tag.d4_statistical_errors #blog.tag.bok #blog.tag.encyclopedia

A doctor tests a new pain medication against a placebo in a trial of twenty patients. The result is not statistically significant. The drug is declared ineffective and shelved. But here is the problem: with only twenty patients, the trial could not have detected a modest but clinically meaningful pain reduction even if it genuinely existed. The study was underpowered — built too small to answer the question it was asked. Its negative result is not evidence of no effect. It is evidence of insufficient evidence. The difference matters enormously, and it is one of the most persistently misunderstood distinctions in all of science.

What Statistical Power Actually Means

Statistical power is the probability that a study will detect a real effect if a real effect exists. Formally, it is 1 minus the Type II error rate (β): the probability of correctly rejecting a false null hypothesis. A study with 80% power has an 80% chance of returning a statistically significant result when the effect being studied is genuine. Equivalently, it has a 20% chance of returning a false negative — of concluding "no effect" when an effect is there.

By convention, the field has historically targeted 80% power as a minimum acceptable threshold. This means accepting a 20% risk of missing real effects — a tolerance many researchers and clinicians would find unacceptably high if they thought it through. In practice, many published studies fall well below even this modest target, with actual power estimates of 30–50% in some subfields, meaning that the majority of real effects in those domains would never be detected by any individual study.

Power depends on three factors:

Sample size: Larger samples provide more precise estimates and more power to detect effects. This is the factor researchers can most directly control.
Effect size: Larger effects are easier to detect. A drug that reduces blood pressure by 30 mmHg is much easier to demonstrate than one that reduces it by 3 mmHg.
Significance threshold (α): A more lenient threshold (e.g., p < 0.10 instead of p < 0.05) increases power but also increases the risk of false positives.

The Consequences of Low Power

The False Negative Flood

The most direct consequence is a literature full of false negatives: real effects that failed to reach significance not because they do not exist but because the study was too small to see them. This is a particular problem in pilot studies, in early-phase clinical trials, and in any domain where data collection is expensive — which includes almost all of medicine, psychology, and economics. A researcher who runs an underpowered study and fails to find an effect is not contributing null evidence to the scientific record in any meaningful sense; they are adding statistical noise.

Winner's Curse and Inflated Effect Sizes

Underpowered studies have a subtler and more insidious effect on the scientific literature: when they do find significance, the effect sizes they report are systematically inflated. This is known as the "winner's curse" or Type M (magnitude) error. To clear the significance threshold in a noisy, small-sample study, a result must be unusually large — it must be a lucky overestimate of the true effect. The studies that reach publication from underpowered designs therefore tend to report effect sizes substantially larger than the true value, which subsequent larger studies then fail to replicate.

This mechanism goes a long way toward explaining the replication crisis in psychology and medicine. When effect sizes in the original literature are systematically inflated, follow-up studies designed to replicate those effects at their reported magnitude will frequently fail — not because the original findings were fabricated, but because they were unlucky overestimates that cleared the significance bar in an underpowered study.

The Asymmetry of Interpretation

Perhaps the most consequential problem is how underpowered null results are interpreted. When a study finds p > 0.05, the result is routinely described as showing "no significant difference," "no evidence of an effect," or even "no effect." Each of these framings progressively misrepresents the actual finding. The correct interpretation is: "this study did not produce sufficient evidence to conclude there is an effect." That is a statement about the study's sensitivity, not about the world.

This confusion between "absence of evidence" and "evidence of absence" is one of the most common logical errors in scientific reporting, in media coverage of research, and in regulatory decision-making. A drug trial that fails to show benefit is not the same as a drug trial that demonstrates harm. A psychology experiment that fails to replicate a finding is not automatically evidence that the original finding was wrong — though it may be evidence that the effect size was smaller than originally believed.

Power Analysis: Designing to Answer

The solution to underpowered research is, in principle, straightforward: conduct a power analysis before collecting data, and collect enough of it. A prospective power analysis requires the researcher to specify:

The smallest effect size that would be scientifically or clinically meaningful (not the largest they hope to find).
The desired level of power (typically 80%, ideally higher).
The significance threshold to be applied (typically 0.05).

From these inputs, standard formulas or software (G*Power is a commonly used tool) calculate the required sample size. The critical word is "meaningful": power analyses based on optimistic effect size estimates produce underpowered studies, because the true effect is smaller than assumed. The appropriate input is the minimum effect worth caring about, not the maximum effect you think you might find.

Post-hoc power analysis — computing power after the study has finished and the result is known — is a different matter. When applied to a non-significant result, it typically produces a tautological finding ("given that we didn't reach significance, our power was low"), because power and significance are mathematically linked when the observed effect size is used as the assumed effect. Post-hoc power is primarily useful as a red flag in reading others' research, not as a tool for drawing conclusions from one's own.

The Replication Crisis and Underpowered Science

The past decade of meta-science has dramatically raised awareness of underpowering as a structural problem in academic research. Ioannidis's landmark 2005 paper "Why Most Published Research Findings Are False" showed mathematically that in fields with low prior probability, small samples, and flexible analysis methods, the majority of statistically significant findings will be false positives. The statistical power issue feeds directly into this: low-powered studies produce inflated effect estimates when they do find significance, and those effects subsequently fail to replicate.

Button et al. (2013) analysed statistical power in neuroscience and found a median power of around 20% — meaning that even real effects would be missed 80% of the time by typical studies in that field. The implication is stark: a negative result in an underpowered field carries almost no inferential weight. It tells you nothing about the world; it tells you only that the study was too small.

This connects to the p-hacking problem from the other direction: researchers under publication pressure may keep collecting data until they reach significance, or may selectively report the analyses that worked. Both practices inflate the false-positive rate. Underpowered research and flexible analysis practices are both consequences of the same underlying system: one in which statistically significant results are disproportionately publishable, creating incentives that distort the evidence base.

What Null Results Actually Tell Us

A well-powered null result — one from a study designed with adequate sample size to detect a meaningful effect if it existed — is genuinely informative. It is evidence that the true effect, if any, is smaller than the minimum detectable effect in the study design. This is valuable knowledge: it constrains the plausible effect size. An underpowered null result constrains nothing. It is a blank page.

Increasingly, journals are recognising this and publishing "registered reports" — articles in which peer review occurs before data collection, judging the quality of the design rather than the significance of the result. If the study is well-designed and adequately powered, it will be published regardless of outcome. This structural change removes the publication incentive that drives underpowering, and it is one of the most promising reforms in scientific methodology to emerge from the replication crisis.

The deeper lesson is one of epistemic humility. Negative results are not neutral. They are either informative (if the study had adequate power) or uninformative (if it did not). Conflating the two is one of the most dangerous errors in scientific reasoning — it fills the literature with apparent null evidence, discourages treatment of real effects, and creates false confidence that questions have been answered when they have barely been asked.

Sources & Further Reading

Cohen, J. "A Power Primer." Psychological Bulletin 112, no. 1 (1992): 155–159. (The foundational text on effect sizes and statistical power.)
Ioannidis, J. P. A. "Why Most Published Research Findings Are False." PLOS Medicine 2, no. 8 (2005): e124.
Button, K. S., et al. "Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience." Nature Reviews Neuroscience 14, no. 5 (2013): 365–376.
Gelman, A., & Carlin, J. "Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors." Perspectives on Psychological Science 9, no. 6 (2014): 641–651.
Lakens, D. "Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-tests and ANOVAs." Frontiers in Psychology 4 (2013): 863.
Wikipedia: Statistical power