Overfitting: When a Model Learns the Noise
Imagine hiring a fortune teller who has memorised every horse race result from the past decade. Given any historical race, they can tell you the winner instantly and with perfect confidence. Now ask them to predict tomorrow's race. They have no idea. Their "model" — their knowledge — is nothing but a list of past outcomes. It has captured every noise, every fluke, every freak result, and built it into a system that explains everything that has already happened and predicts nothing that hasn't. This is overfitting: the construction of a model that is so precisely tailored to training data that it ceases to be a model at all, and becomes a mere record.
What Overfitting Actually Means
In any predictive or explanatory task, the goal is to find the signal in the noise — the genuine underlying pattern that will continue to hold in new data. All real-world data contains both signal (the real relationship we care about) and noise (random variation, measurement error, one-off flukes). A well-fitted model captures the signal and ignores the noise. An overfitted model captures both, treating every wrinkle in the data as meaningful.
The technical signature of overfitting is a large gap between performance on training data and performance on new, unseen data. An overfitted model scores brilliantly on the data it was built from — sometimes achieving near-perfect accuracy — and then performs poorly on a held-out test set or in real-world deployment. The model has not learned; it has memorised. And memorisation does not generalise.
Overfitting typically occurs when a model has too many parameters relative to the amount of training data. A straight line through ten data points captures the general trend. A tenth-degree polynomial drawn through the same ten points will pass through every single one of them — achieving zero training error — but between those points it will loop and curve in ways that reflect nothing but mathematical artefact. The polynomial is more complex; it is also less useful.
The Curse of Too Many Parameters
The statistician George Box famously observed that "all models are wrong, but some are useful." The art of modelling lies in choosing the right degree of complexity — enough to capture real patterns, not so much that you start explaining noise. This balance is sometimes called the bias-variance trade-off. A model with too little complexity has high bias: it misses real patterns. A model with too much complexity has high variance: it is hypersensitive to the specific data it was trained on, and its predictions swing wildly with small changes in input.
In machine learning, overfitting is a central concern. Neural networks with millions of parameters, trained on datasets of thousands of examples, have enormous capacity to overfit. A model that achieves 99% accuracy on its training images and 60% accuracy on new images has not succeeded — it has memorised the training images. The field has developed a range of countermeasures: regularisation (penalising complexity), dropout (randomly disabling connections during training), early stopping (halting training before the model over-adapts), and cross-validation (systematically testing on held-out data during development).
Cross-validation deserves special attention as a diagnostic tool. By repeatedly splitting data into training and testing partitions, it gives a realistic estimate of how a model will perform on genuinely new data. A model whose cross-validated performance is much worse than its training performance is almost certainly overfitted — a warning sign that should trigger simplification, more data, or both.
Overfitting Beyond Machine Learning
Financial Models
In quantitative finance, overfitting is sometimes called "backtest overfitting" or, more bluntly, "curve-fitting." A trading algorithm that is optimised against historical market data will often appear extraordinarily profitable in backtests — and then fail spectacularly in live trading. The historical data contains thousands of specific patterns that are purely coincidental: the fact that the market rose on every third Tuesday in October for seven years is not a tradable signal, but a sufficiently flexible model will find it and build a strategy around it.
Researchers Marcos López de Prado and David Bailey have documented this extensively, showing that when enough strategies are tested against the same historical dataset, some will inevitably appear highly profitable by chance alone. The phenomenon links directly to p-hacking and data dredging: the more hypotheses you test, the more apparent patterns you will find, and most of them will be noise. The financial industry's graveyard of failed quant funds is, in significant part, a graveyard of overfitted models.
Medical Diagnostics
Overfitting also threatens clinical prediction models — tools designed to estimate a patient's risk of disease, complication, or death. A diagnostic model developed on a single hospital's patient records may achieve excellent discrimination in that setting, then fail when applied at another hospital with a slightly different patient population, different measurement protocols, or different baseline disease rates. The model has learned features specific to its training environment, not features of the underlying biology.
A notable example is the APACHE scoring system for ICU patients, and its successors: each iteration was developed on new data in part because earlier versions had overfit to their training populations. The Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines were developed precisely to address this problem, requiring rigorous external validation before clinical prediction models can be recommended for use.
Scientific Research
In academic research, overfitting appears when researchers add variables to a regression model until it fits their dataset well, without accounting for the loss of degrees of freedom or the resulting inflation of false-positive risk. A model with fifteen predictors and forty observations will typically achieve a high R² — not because the predictors are genuinely important, but because the model has enough parameters to contort itself around the specific data. The result often fails to replicate. This issue connects closely to the publication bias problem: models that fit well get published; the many failed replications do not.
The Deeper Problem: Explaining the Past
There is a philosophical dimension to overfitting that extends beyond technical statistics. Human cognition is extraordinarily good at finding patterns — so good that it routinely finds patterns that are not there. Apophenia, the tendency to perceive meaningful connections in random data, is the cognitive analogue of overfitting. We construct narratives that explain past events with apparent precision — "the housing market crashed because of X, Y, and Z" — and then mistake the completeness of the narrative for its predictive validity.
The economist Nassim Nicholas Taleb has made this point forcefully: the best explanation of past events is almost always useless as a predictor of future events, because it has been optimised for the specific sequence that occurred, not for the class of sequences that could occur. A theory that perfectly explains the 2008 financial crisis will rarely predict the next one, because the next one will be different in exactly the ways that the theory's tailored specificity cannot accommodate.
Diagnosing and Preventing Overfitting
Practically speaking, the main tools against overfitting are:
- Hold-out testing: Reserve data that the model never sees during development, and evaluate only on that. A model should never be judged by its performance on its own training data.
- Simpler models: Given two models with similar predictive accuracy, prefer the simpler one. This is Occam's razor applied to statistics. The parsimony principle is not just aesthetic — it is epistemically defensible.
- Regularisation: Penalise model complexity explicitly. Lasso and ridge regression, for example, shrink parameter estimates toward zero, preventing the model from over-adapting to peculiarities in the training data.
- More data: With enough data, the noise averages out and the signal dominates. Many overfitting problems dissolve when more training examples are available, because there simply isn't enough noise-space for the model to memorise.
- Pre-registration: In research contexts, committing to a model specification before seeing the data prevents post-hoc fitting. This is the structural solution to the overfitting problem in science.
Overfitting is ultimately a failure of intellectual humility — the mistake of believing that a perfect fit to past data is evidence of genuine understanding. The best models are wrong in simple, predictable ways. The worst models are right about everything that has already happened, and wrong about everything that matters.
Sources & Further Reading
- Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009. Ch. 7 (Model Assessment and Selection).
- Bailey, D. H., & López de Prado, M. "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40, no. 5 (2014): 94–107.
- Collins, G. S., et al. "Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD)." Annals of Internal Medicine 162, no. 1 (2015): 55–63.
- Taleb, N. N. The Black Swan: The Impact of the Highly Improbable. Random House, 2007.
- Domingos, P. "A Few Useful Things to Know About Machine Learning." Communications of the ACM 55, no. 10 (2012): 78–87.
- Wikipedia: Overfitting