3.1 OVERVIEW

Preparing the data is one of the most time-consuming parts of any data analysis/data mining project. This chapter outlines concepts and steps necessary to prepare a data set prior to any data analysis or data mining exercise. How the data is collected and prepared is critical to the confidence with which decisions can be made. The data needs to be pulled together into a table. This may involve integration of the data from multiple sources. Once the data is in a tabular format it should be fully characterized. The data should also be cleaned by resolving any ambiguities, errors, and removing redundant and problematic data. Certain columns of data can be removed if it is obvious that they would not be useful in any analysis. For a number of reasons, new columns of data may need to be calculated. Finally, the table should be divided, where appropriate, into subsets that either simplify the analysis or allow specific questions to be answered more easily.

Details concerning the steps taken to prepare the data for analysis should be recorded. This not only provides documentation of the activities performed so far, but also provides a methodology to apply to a similar data set in the future. In addition, the steps will be important when validating the results since these records will show any assumptions made about the data.

The following chapter outlines the process of preparing data for analysis. It includes information on the sources of data along with methods for characterizing, ...

Get Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.