Chapter 11. Dealing with Messy Data

As mentioned in the last chapter, analyzing data in the real world often requires some know-how outside of the typical introductory data analysis curriculum. For example, rarely do we get a neatly formatted, tidy dataset with no errors, junk, or missing values. Rather, we often get messy, unwieldy datasets.

What makes a dataset messy? Different people in different roles have different ideas about what constitutes messiness. Some regard any data that invalidates the assumptions of the parametric model as messy. Others see messiness in datasets with a grievously imbalanced number of observations in each category for a categorical variable. Some examples of things that I would consider messy are:

  • Many missing values ...

Get R: Data Analysis and Visualization now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.