Chapter 2DATA PREPROCESSING

  1. 2.1 Why do We Need to Preprocess the Data?
  2. 2.2 Data Cleaning
  3. 2.3 Handling Missing Data
  4. 2.4 Identifying Misclassifications
  5. 2.5 Graphical Methods for Identifying Outliers
  6. 2.6 Measures of Center and Spread
  7. 2.7 Data Transformation
  8. 2.8 Min-Max Normalization
  9. 2.9 Z-Score Standardization
  10. 2.10 Decimal Scaling
  11. 2.11 Transformations to Achieve Normality
  12. 2.12 Numerical Methods for Identifying Outliers
  13. 2.13 Flag Variables
  14. 2.14 Transforming Categorical Variables into Numerical Variables
  15. 2.15 Binning Numerical Variables
  16. 2.16 Reclassifying Categorical Variables
  17. 2.17 Adding an Index Field
  18. 2.18 Removing Variables that are Not Useful
  19. 2.19 Variables that Should Probably Not Be Removed
  20. 2.20 Removal of Duplicate Records
  21. 2.21 A Word About Id Fields
    1. The R Zone
    2. References
    3. Exercises
    4. Hands-On Analysis

Chapter 1 introduced us to data mining, and the CRISP-DM standard process for data mining model development. In Phase 1 of the data mining process, business understanding or research understanding, businesses and researchers first enunciate project objectives, then translate these objectives into the formulation of a data mining problem definition, and finally prepare a preliminary strategy for achieving these objectives.

Here in this chapter, we examine the next two phases of the CRISP-DM standard process, data understanding and data preparation. We will show how to evaluate the quality of the data, clean the raw data, deal with missing data, and perform transformations on ...

Get Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.