🧪 This platform is in early beta. Features may change and you might encounter bugs. We appreciate your patience!
sieve_bias
Sieve bias occurs when data passes through multiple filtering or selection steps, each of which may introduce its own subtle bias. While any single filter might have a minor effect, the cumulative result of successive filtering can produce a final sample that is profoundly unrepresentative of the original population. The compounding nature of sequential selection makes the total bias much larger and harder to predict than any individual step would suggest.
A clinical study starts with 10,000 patients, then restricts to those who completed intake forms (excluding the sickest), then to those with follow-up data (excluding dropouts who experienced side effects), then to those with complete lab results (excluding the poorest). The final 2,000 patients are healthier, wealthier, and more compliant than the original population.
A tech company surveys employees about workplace satisfaction, but only workers with a company email account are invited, then only those who open the HR newsletter see the survey link, then only those who feel strongly enough bother to respond. Each filter quietly removes a different type of employee — contractors, disengaged staff, and those with mild opinions — leaving a final sample that bears little resemblance to the actual workforce.
An economics study on the returns to education uses administrative records that first exclude anyone without a social security number, then drop records with incomplete wage data, then remove individuals who changed jobs more than twice. Immigrants, gig workers, and the most economically mobile people disappear through successive cuts, and the estimated wage premium for a college degree reflects only a narrow, stable slice of the labor market.
Binary (yes/no) questions an LLM must answer to identify this aspect:
Has the data been filtered through multiple sequential selection criteria?
Type: binaryCould each filtering step disproportionately remove certain types of observations?
Type: binaryIs the remaining sample systematically different from the original population after all filters are applied?
Type: binaryHas the cumulative effect of all filtering steps on sample composition been assessed?
Type: binarySieve bias occurs when data passes through multiple filtering or selection steps, each of which may introduce its own subtle bias. While any single filter might have a minor effect, the cumulative result of successive filtering can produce a final sample that is profoundly unrepresentative of the original population. The compounding nature of sequential selection makes the total bias much larger and harder to predict than any individual step would suggest.
Each filtering criterion seems reasonable in isolation, and researchers may not track how the sample composition changes across all steps. The combined effect of many small biases is non-obvious and can radically alter who remains in the study without anyone noticing the cumulative distortion.
Document the sample size and composition at each filtering step. Create flow diagrams showing attrition. Compare characteristics of included and excluded participants at each stage. Use multiple imputation or inverse probability weighting to account for systematic dropouts.
Common in clinical trials with strict inclusion criteria, data science pipelines with multiple cleaning steps, hiring processes with sequential screening rounds, and systematic reviews with multi-stage study selection.
The statistical error of drawing conclusions from a dataset that has been filtered by a survival or success criterion, without accounting for the filtered-out cases. The surviving sample is systematically different from the full population, and conclusions drawn from it are biased.
Systematic difference between respondents and non-respondents distorting study results.
A statistical error that occurs when conditioning on a variable that is causally affected by two other variables creates a spurious association between those two variables. In a causal diagram, a collider is a variable where two causal arrows converge, and conditioning on it opens a non-causal path.
Use these tools to detect, analyze, or train this aspect.