Insensitivity to Sample Size: Why Small Numbers Lie to Us
Imagine two hospitals. The small one delivers about 15 babies per day; the large one, around 45. On any given day, roughly half of all newborns are boys. Each hospital keeps a record of every day when more than 60% of babies born were male. At year's end, which hospital had more such exceptional days? Most people say the hospitals should have roughly the same number — or they simply shrug and guess equal. The correct answer: the small hospital, and by a significant margin. This is insensitivity to sample size in its purest form, and it reveals a deep flaw in how human intuition handles uncertainty.
The Hospital Problem
The hospital scenario was introduced by Amos Tversky and Daniel Kahneman in their landmark 1974 paper "Judgment Under Uncertainty: Heuristics and Biases." It illustrates what they called the law of small numbers — a tongue-in-cheek reference to the genuine statistical law of large numbers, which states that as a sample grows larger, its statistics converge toward the true population value.
The law of large numbers implies, logically, a law of small numbers: small samples are not representative. A small hospital with 15 daily births will naturally produce far more extreme results — days with 70% boys, days with 80% girls — simply due to random variation. With only 15 observations, extreme deviations from 50% are common. With 45 observations, they are much rarer. Statistics students know this as sampling error or sampling variance. Human intuition does not naturally understand it.
In Tversky and Kahneman's experiment, participants treated the two hospitals as equally likely to produce extreme results. They were essentially applying a principle of representativeness: the distribution of a sample should look like the distribution of the population, regardless of size. This is precisely backwards. Small samples are more extreme, not less.
Why Our Intuition Gets This Wrong
The error isn't stupidity. It's a systematic feature of how the mind models probability. When we judge whether a sample is "representative," we compare it to our mental model of the population — and we don't automatically adjust for size. A 70% proportion feels unlikely whether it comes from 10 births or 10,000. Our intuitive feel for probability is largely insensitive to the denominator.
This is closely related to the representativeness heuristic: we judge likelihood by how much a sample looks like what we expect from the population. But representativeness is a description of pattern, not of probability. A coin showing 7 heads in 10 flips doesn't look representative (we expect 5), but it is statistically quite common. A coin showing 700 heads in 1,000 flips looks equally unrepresentative — but is astronomically unlikely. The intuitive system treats both with similar suspicion, when they are profoundly different situations.
Kahneman later reflected that people tend to expect even small samples to reflect population characteristics faithfully — what he called "believing in the law of small numbers." This produces a kind of magical thinking about data: we extract conclusions from handfuls of observations with the same confidence we should reserve for properly powered studies.
The Real-World Cost
Medical Research
Early clinical findings are notorious for failing to replicate. A small pilot study of 20 patients showing a dramatic drug effect gets published, generates excitement, influences clinical practice — and then a properly powered study of 2,000 patients shows no effect at all. This isn't fraud; it's the law of small numbers at work. Small samples generate extreme results by chance, and positive extreme results get published, noticed, and acted upon. The "replication crisis" in psychology, medicine, and nutrition science is partly a crisis of insensitivity to sample size at the institutional level: researchers, reviewers, and readers all systematically underestimate how noisy small samples are.
The same dynamic operates in how patients interpret their own experience. "I took vitamin C and my cold cleared up in three days" is a sample of one. The base rate of cold duration, the regression to the mean, the natural course of illness — none of this competes effectively with the vivid personal data point. See also: availability heuristic, where vivid personal experience outweighs statistical abstraction.
Online Reviews
A restaurant with 4 reviews averaging 4.8 stars is likely to have a wildly inaccurate rating. A restaurant with 1,200 reviews averaging 4.1 stars is giving you much more reliable information — but the first one will feel more impressive at a glance. We see the star number before we register the review count, and even when we see both, we don't intuitively discount the small-sample rating appropriately. This is why new products and businesses often show extreme ratings: a handful of enthusiastic early adopters (or disappointed first customers) produces a signal that would disappear in a larger sample.
A/B Testing
In digital marketing and product development, A/B testing is now standard practice. But insensitivity to sample size has produced an epidemic of "peeking": checking results before the pre-specified sample size is reached and stopping the test early when the result looks good. This dramatically inflates false positive rates. A/B test results based on a few dozen conversions are nearly meaningless — yet organisations routinely make product decisions, pricing choices, and UX changes on exactly this basis. The intuition that a 60/40 split in 30 trials is meaningful is the same intuition that expects small hospitals to match large ones.
Sports and Performance
Early-season statistics in sports drive enormous amounts of commentary, prediction, and management decisions. A batter hitting .380 through 25 games, a quarterback with a perfect passer rating through two weeks — these are noise, not signal. Regression to the mean will relentlessly pull extreme early performances toward the population average over a full season. Yet sports media, fans, and often coaches treat small samples as if they revealed stable truth. Players are benched, managers are fired, strategies are overhauled based on what amounts to statistical noise. This is the gambler's fallacy's quieter cousin: instead of expecting randomness to correct itself, we mistake randomness for a real pattern.
The Research Laboratory
Tversky and Kahneman showed that the bias affected not just lay people but researchers themselves. In a 1971 paper ("Belief in the Law of Small Numbers"), they surveyed academic psychologists and found that even experienced researchers significantly overestimated the replicability of findings from small samples. They routinely designed studies with insufficient statistical power, expecting that a small, well-designed study should produce a reliable result. They were wrong, and they knew enough statistics to know better — but their intuitive confidence outran their statistical knowledge.
The practical consequence is that researchers sample too few subjects, trust early results too much, and stop data collection as soon as results look significant. Each of these behaviours is partially driven by insensitivity to sample size.
Improving Calibration
The good news is that this bias is correctable with explicit statistical reasoning — which is precisely what it requires, because intuition won't do it automatically.
- Always note the sample size first. Before interpreting any result, statistic, or review, note n. A finding based on 12 observations should feel very different from one based on 1,200.
- Calculate confidence intervals. A proportion of 60% from a sample of 10 has a 95% confidence interval roughly spanning 26% to 88% — nearly the entire range. The same proportion from 1,000 observations spans 57% to 63%. Confidence intervals make uncertainty visible in a way that point estimates don't.
- Pre-register sample sizes. In research and business experimentation, decide before collecting data how large the sample must be to detect a meaningful effect. Don't peek. Don't stop early.
- Distrust early-season, early-launch, or early-trial data. Apply a deliberate mental discount to extreme results from small samples. The signal-to-noise ratio is low; wait for more data.
The hospital problem seems abstract until you realise it's about whether you change your diet based on one week of results, fire a manager after a bad quarter, or restructure a product based on a handful of user complaints. Wherever small numbers look like big truth, the law of small numbers is operating — and getting it wrong.
Sources & Further Reading
- Tversky, A., & Kahneman, D. "Judgment Under Uncertainty: Heuristics and Biases." Science 185, no. 4157 (1974): 1124–1131.
- Tversky, A., & Kahneman, D. "Belief in the Law of Small Numbers." Psychological Bulletin 76, no. 2 (1971): 105–110.
- Kahneman, D. Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011. Chapter 10: "The Law of Small Numbers."
- Simmons, J. P., Nelson, L. D., & Simonsohn, U. "False-Positive Psychology." Psychological Science 22, no. 11 (2011): 1359–1366.
- Wikipedia: Insensitivity to sample size