6.5. Summary (and Sermonette)

These dairy product and calendar examples are obviously contrived. They are not far removed from many ill-conceived quantitative investment and trading ideas. It is just as easy to fool yourself with ideas that are plausible-sounding and no more valid.

Just because something appears plausible, that doesn't mean that it is. The wide availability of machine-readable data, and the tools to analyze it, easily means that there are a lot more regressions going on than Legendre could ever have imagined back in 1805. If you look at 100 regressions that are significant at a level of 95 percent, five of them are there just by chance. Look at 100,000 models at 95 percent significance, and 5,000 are false positives. Data mining, good or bad, is next to impossible to do without a computer.

When doing this kind of analysis, it is important to be very careful of what you ask for, because you will get it. Holding back part of your data is the first line of a defense against data mining. Leaving some of the data out of the sample used to build the model is a good idea as is holding back some data to use in testing the model. This holdback sample can be a period of time or a cross section of data. The cross-sectional holdback works where there is enough data to do this, as in the analysis of individual stocks. You can use stocks with symbols starting with A through L for model building and save M through Z for verification purposes.

It is possible to mine these holdback ...

Get Nerds on Wall Street: Math, Machines, and Wired Markets now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.