Data cleaning

Data cleaning, also known as data cleansing or data scrubbing, is a process consisting of the following steps:

  1. Identifying inaccurate, incomplete, irrelevant, or corrupted data to remove it from further processing
  2. Parsing data, extracting information of interest, or validating whether a string of data is in an acceptable format
  3. Transforming data into a common encoding format, for example, UTF-8 or int32, time scale, or a normalized range
  4. Transforming data into a common data schema; for instance, if we collect temperature measurements from different types of sensors, we might want them to have the same structure

Get Machine Learning in Java - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.