Chapter 9. When Data and Reality Don’t Match

Spencer Burns

It is common knowledge that beating the stock market is hard. But on the face of things, it seems like purely an analysis problem, not a data problem. Modeling is difficult; building a timeseries should be simple. For every day (or minute, or millisecond), for each unique stock symbol, there is a listed price at which you can buy shares, and you can sell those same shares later, comparing the two prices to calculate your profit.

Every assumption in the preceding statement is usually true, yet each of them fail often enough to ruin a model. A series of stock data may look structured and clean, but that neatness hides the idiosyncratic path of how a given stock got to where it is today. It is “good” data covering up for messy reality.

Stock data does not come out of independent observations of markets; the data is an integral part of the market. There is a tight feedback loop where data about the state of the market affects the market (e.g., rising prices may cause people to push prices up further). Furthermore, what happens in the market can change the nature of the market. Some examples of this are when companies with falling prices leave the market, or ones with rising prices buy other companies, or when market crashes trigger changes in trading rules.

What this means in practice is that the stock market of today is not exactly the same as the market of yesterday; every day is a new experiment. Data needs to be constantly ...

Get Bad Data Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.