Overfitting — When Numbers Lie

Has this ever happened to you? An analyst builds a stock market prediction model using 50 variables and 100 days of data.

Also known as: overtraining, curve fitting, memorization

What's Actually Happening

Overfitting occurs when a statistical model or analysis captures noise and random fluctuations in the training data rather than the underlying pattern. An overfitted model performs excellently on the data it was built on but fails to generalize to new, unseen data. This happens when the model is too complex relative to the amount of data available, allowing it to memorize specific data points rather than learning general relationships.

High accuracy on known data is intuitively convincing. People confuse descriptive accuracy (fitting past data) with predictive accuracy (forecasting new data). More complex models always fit training data better, creating an illusion of superior performance.

Real Talk: You See This Every Day

An analyst builds a stock market prediction model using 50 variables and 100 days of data. The model perfectly 'predicts' past prices, achieving 99% accuracy on historical data. When applied to the next month's data, it performs worse than simply guessing the market will stay flat.

Overfitting is a central concern in machine learning, financial modeling (backtested trading strategies), weather forecasting, and epidemiological projections.

Your BS Detector

Always validate models on held-out data the model has never seen. Use cross-validation, apply regularization techniques, and prefer simpler models when predictive performance is comparable (Occam's razor).

✓ Who collected this data, and why?
✓ Is the sample big enough and fair?
✓ Could there be another explanation?

The Challenge

Next time someone throws a statistic at you — in class, online, in the news — don't just accept it. Ask: what's missing from this picture?

Part of the TellDear Teen Book — criticalthinking.guide