We already have encountered some pitfalls of data analysis. In one example a problem arose because one had overlooked that carriers and controls had not been properly matched (see Section 2.3). Another example was concerned with the fact that it is no longer possible to calculate reliable P-values after one has looked at the data (see Section 2.4.4). This does not mean that one should not look at the data – on the contrary – but merely that one should not fool oneself and others by spurious P-values. The present chapter offers examples of three further common pitfalls: Simpson’s paradox, problems caused by unrecognized missingness, and conceptual pitfalls of regression. For the sake of clarity, they shall be illustrated with the help of relatively small examples. But with larger data sets all these pitfalls become more difficult to recognize, and I believe that with the advent of data mining, that is, with the unsupervised grinding through massive data sets, they have become particularly pernicious. Good traps are camouflaged, and data analytic pitfalls are no exception – they often hide behind the smokescreen provided by a complex application.
The problem with Simpson’s paradox is that only few textbooks and hardly any statistics courses draw attention to it.
Missing values often are considered as a mere nuisance because a data matrix with holes in it cannot be handled by the standard formulas of linear algebra. Therefore, some (hopefully sensible) values are ...