blog.category.aspect Mar 29, 2026 7 min read

Simpson's Paradox: When the Whole Lies About Its Parts

#blog.tag.aspect #blog.tag.d4_statistical_errors #blog.tag.bok #blog.tag.encyclopedia

In 1973, the University of California, Berkeley, was sued for gender bias in graduate admissions. The numbers seemed damning: in the fall of that year, 44% of male applicants were admitted, while only 35% of female applicants were accepted. The disparity was statistically significant. It looked like open-and-shut discrimination. But when statisticians P. J. Bickel, E. A. Hammel, and J. W. O'Connell examined admissions department by department, they found something that turned the headline figure on its head: in most individual departments, women were admitted at the same rate as men, or higher. The university-wide aggregate that screamed bias was an illusion produced by a confounding pattern in which students applied. This is Simpson's Paradox — one of the most counterintuitive and consequential phenomena in statistics.

The Basic Structure

Simpson's Paradox (also called the Yule-Simpson effect) occurs when a statistical association that appears in aggregated data reverses, disappears, or changes direction when the data is disaggregated into subgroups. A trend visible in the whole is contradicted by the trends in each part. The paradox is not a logical error or a data corruption — it is a mathematically legitimate consequence of aggregating groups that differ in size and composition.

The mechanism requires two things: a confounding variable (a grouping variable that affects both the exposure and the outcome), and an imbalance in how groups are represented across categories. When these ingredients are present, the aggregated result can point in the opposite direction from every individual component.

Berkeley: Bias Without Bias

The Berkeley case illustrates how it works. Women were indeed less likely to be admitted overall — but that was because women disproportionately applied to the most competitive departments (English, History, Social Sciences), while men disproportionately applied to less competitive departments (Engineering, Chemistry, Physics) that admitted a much higher fraction of all applicants. Within any single department, women were being admitted at rates comparable to or slightly better than men. The aggregate disparity was entirely explained by the different application patterns, not by differential treatment within departments.

The confounding variable was the choice of department: it simultaneously predicted whether an applicant was male or female (women preferred humanities, men preferred sciences) and predicted admission probability (humanities departments were harder to get into). Because the confounder was invisible in the aggregate numbers, the aggregate told a completely false story.

The researchers published their analysis in Science in 1975. Their conclusion: there was, if anything, a "small but statistically significant bias in favor of women." The discrimination lawsuit was based on a statistical artefact.

Kidney Stones: Which Treatment Is Better?

A 1986 study by Charig, Webb, Payne, and Wickham compared the success rates of two kidney stone treatments: Treatment A (open surgery) and Treatment B (percutaneous nephrolithotomy — a minimally invasive keyhole procedure).

Looking at the overall data:

Treatment A: 273 successes out of 350 patients = 78% success rate
Treatment B: 289 successes out of 350 patients = 83% success rate

Treatment B looks better. But when the patients are divided by kidney stone size:

Small stones: Treatment A: 93% (81/87) vs Treatment B: 87% (234/270)
Large stones: Treatment A: 73% (192/263) vs Treatment B: 69% (55/80)

Treatment A is better for both small stones and large stones. Yet Treatment B looks better overall. How?

The confounder is stone size. Surgeons tended to use the invasive Treatment A for the hardest cases — large stones — and reserve the gentler Treatment B for small stones. Large stones have worse outcomes regardless of treatment, so Treatment A's aggregate numbers are dragged down by handling the difficult cases. Treatment B's numbers are inflated by handling the easy ones. Compare within the appropriate category, and the direction reverses.

Baseball Batting Averages: David Justice and Derek Jeter

One of the most beloved examples among statisticians involves Derek Jeter and David Justice, whose batting averages in the 1995 and 1996 seasons illustrate Simpson's Paradox in miniature:

1995: David Justice .253, Derek Jeter .250 → Justice wins
1996: David Justice .321, Derek Jeter .314 → Justice wins again

But combined over both years: Derek Jeter hit .300 overall, David Justice .270. Jeter's combined average is higher, even though Justice beat him in both individual years. The reason: Jeter had very few at-bats in 1995 (his .250 was over just 48 at-bats), so his combined average is dominated by his excellent 1996 performance. Justice had more at-bats in 1995, so his combined average is pulled down by his weaker that year, despite beating Jeter in both seasons individually. The group sizes create the paradox.

COVID-19 and Vaccine Effectiveness

During the COVID-19 pandemic, Simpson's Paradox surfaced in vaccine effectiveness statistics in ways that were weaponised by misinformation. Some aggregate statistics from the UK in late 2021 showed that vaccinated people were making up a large percentage of COVID hospitalisations — which some interpreted as evidence that vaccines were not working or were making things worse.

The explanation was Simpson's Paradox in action. The vaccinated population was heavily skewed toward older, more vulnerable people who had been prioritised for vaccination. Older people are much more likely to be hospitalised regardless of vaccination status. When you control for age — compare vaccinated and unvaccinated people in the same age bracket — vaccination dramatically reduces hospitalisation risk in every group. The aggregate looked alarming; the disaggregated truth was reassuring. The confounding variable was age, and ignoring it produced a dangerous statistical illusion.

Why It Matters for Policy

Simpson's Paradox is not merely a mathematical curiosity — it has direct implications for how data should be used in decision-making:

Aggregate statistics can actively mislead. When comparing treatment outcomes, educational results, economic performance, or any domain with heterogeneous subgroups, aggregation without stratification risks producing the opposite conclusion from the truth.
The choice of which variables to control for is critical. In the Berkeley case, controlling for department revealed the truth. In the kidney stone case, controlling for stone size did. In vaccine data, controlling for age was essential. The right answer depends on knowing — or reasoning carefully about — the causal structure of the data.
Disaggregated data is often politically inconvenient. Government statistics, corporate reports, and academic papers are routinely presented in aggregate form that may hide or reverse subgroup realities. Asking "how does this break down?" is one of the most productive habits a data consumer can develop.

The Deeper Question: Which Analysis Is Right?

A subtle and genuinely difficult aspect of Simpson's Paradox is that there is not always a single correct answer about which level of aggregation is valid. Statistician Judea Pearl argues that the paradox is resolved by causal reasoning rather than statistical analysis alone: you need a causal model of the situation to know whether to control for the grouping variable or not.

In the Berkeley case, controlling for department is correct because the choice of department was made before the admission decision, and the question is whether the admissions process itself discriminated — so you should compare within the departments where the decisions were made. In other contexts, however, controlling for an intermediate variable can produce misleading results: if the grouping variable is itself caused by the treatment you're studying, controlling for it "blocks" a legitimate causal pathway and distorts the estimate.

This connects Simpson's Paradox to the broader challenge of confounding variables and to the fundamental difficulty of causal inference from observational data. The paradox is a dramatic illustration of why correlation is not causation — and why understanding the structure of a data-generating process is as important as the data itself.

Recognising Simpson's Paradox in the Wild

Practical signals that Simpson's Paradox may be distorting a conclusion:

You are comparing aggregate rates or percentages across heterogeneous groups
The groups being compared differ substantially in composition (age, size, severity, income)
The outcome is strongly influenced by a factor that is unevenly distributed across groups
The conclusion feels like it might reverse if you "zoomed in" on subgroups

The corrective instinct is simply to ask: "Does this conclusion hold in every subgroup?" If the overall trend disappears or reverses when you break the data down, a confounding variable is at work — and the disaggregated analysis is almost always closer to the truth you actually want.

Sources

Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975). "Sex bias in graduate admissions: Data from Berkeley." Science, 187(4175), 398–404.
Charig, C. R., et al. (1986). "Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy." British Medical Journal, 292(6524), 879–882.
Pearl, J. (2014). "Comment: Understanding Simpson's Paradox." The American Statistician, 68(1), 8–13.
Appleton, D. R., French, J. M., & Vanderpump, M. P. J. (1996). "Ignoring a covariate: An example of Simpson's paradox." The American Statistician, 50(4), 340–341.
Hernán, M. A., Clayton, D., & Keiding, N. (2011). "The Simpson's paradox unraveled." International Journal of Epidemiology, 40(3), 780–785.