🧪 This platform is in early beta. Features may change and you might encounter bugs. We appreciate your patience!
model_selection_bias
Model selection bias occurs when the final statistical model is chosen after examining the data, optimistically biasing all parameter estimates, standard errors, and fit statistics from that model. Stepwise regression and other automated selection procedures search over many model specifications using the same data used for estimation. The selected model over-fits the training data and will not replicate on new data.
A researcher uses stepwise regression to select from 30 candidate predictors. The algorithm retains 8 predictors yielding R² = 0.62. This R² is treated as if the 8 predictors were pre-specified. On new data, the true R² is likely 0.20 or lower.
A data scientist building a customer churn model tries 12 different machine learning algorithms and tunes hyperparameters on the full dataset, then reports the best-performing model's 94% accuracy. Because the model was chosen based on its performance on that same data, the reported accuracy is almost certainly inflated and will not hold on truly new customers.
A political scientist tests whether economic anxiety, immigration concern, religious attendance, education level, and 6 other variables predict voting behavior. After dropping non-significant predictors, she reports a clean 4-variable model as if those 4 variables were theoretically motivated from the start. The p-values and effect sizes are biased because the model was sculpted by the data itself.
Binary (yes/no) questions an LLM must answer to identify this aspect:
Was the final model chosen after examining model fit statistics on the training data?
Type: binaryWere multiple model specifications compared and the best-fitting one selected?
Type: binaryAre standard errors and p-values reported as if the model were pre-specified?
Type: binaryWas model performance validated on an independent dataset?
Type: binaryModel selection bias occurs when the final statistical model is chosen after examining the data, optimistically biasing all parameter estimates, standard errors, and fit statistics from that model. Stepwise regression and other automated selection procedures search over many model specifications using the same data used for estimation. The selected model over-fits the training data and will not replicate on new data.
Each step of model selection optimizes the fit statistic on the current data. The search process finds the model that fits best by chance. Standard inferential procedures assume a pre-specified model.
Pre-register the model specification. Use regularization (LASSO, ridge). Validate on an independent holdout sample. Apply cross-validation. Report the total number of models examined.
Biomarker studies that use automated feature selection from high-dimensional -omics data systematically overfit and fail to replicate. The field of radiomics has been particularly affected.
Use these tools to detect, analyze, or train this aspect.