Model Selection Bias

Also Known As: Stepwise selection bias Data-driven model selection inflation

Aspect ID: model_selection_bias

Definition

Model selection bias occurs when the final statistical model is chosen after examining the data, optimistically biasing all parameter estimates, standard errors, and fit statistics from that model. Stepwise regression and other automated selection procedures search over many model specifications using the same data used for estimation. The selected model over-fits the training data and will not replicate on new data.

Examples

A researcher uses stepwise regression to select from 30 candidate predictors. The algorithm retains 8 predictors yielding R² = 0.62. This R² is treated as if the 8 predictors were pre-specified. On new data, the true R² is likely 0.20 or lower.

A data scientist building a customer churn model tries 12 different machine learning algorithms and tunes hyperparameters on the full dataset, then reports the best-performing model's 94% accuracy. Because the model was chosen based on its performance on that same data, the reported accuracy is almost certainly inflated and will not hold on truly new customers.

A political scientist tests whether economic anxiety, immigration concern, religious attendance, education level, and 6 other variables predict voting behavior. After dropping non-significant predictors, she reports a clean 4-variable model as if those 4 variables were theoretically motivated from the start. The p-values and effect sizes are biased because the model was sculpted by the data itself.

Verification Steps

Verification Steps

Binary yes/no questions that an AI must answer to detect a reasoning pattern in a text.

Each of the 452 aspects has verification steps — simple yes/no questions designed to systematically detect whether a pattern appears in a text. For ad hominem: "Does the argument attack a person rather than their claim?" For false dichotomy: "Are only two options presented when more exist?" This ensures consistent, reproducible analysis.

View in glossary →

Binary (yes/no) questions an LLM must answer to identify this aspect:

1

Was the final model chosen after examining model fit statistics on the training data?
Type: binary
2

Were multiple model specifications compared and the best-fitting one selected?
Type: binary
3

Are standard errors and p-values reported as if the model were pre-specified?
Type: binary
4

Was model performance validated on an independent dataset?
Type: binary

Description

Why It Works

Each step of model selection optimizes the fit statistic on the current data. The search process finds the model that fits best by chance. Standard inferential procedures assume a pre-specified model.

How to Counter

Pre-register the model specification. Use regularization (LASSO, ridge). Validate on an independent holdout sample. Apply cross-validation. Report the total number of models examined.

Also Known As

Stepwise selection bias Data-driven model selection inflation

Real-World Context

Biomarker studies that use automated feature selection from high-dimensional -omics data systematically overfit and fail to replicate. The field of radiomics has been particularly affected.

Related Aspects

P-Hacking (Data Dredging) Freedman's Paradox Overfitting

Try it in action

Use these tools to detect, analyze, or train this aspect.

🔍 Text Analyzer

Scan a text for this pattern

⚗️ Argument Lab

Analyze an argument step by step

🎓 Fallacy Trainer

Quiz yourself on this aspect