Cleansing data

Data from the real world is not always as perfect as we would like it to be. On one hand, there are cases where the errors in data are so critical that the only solution is to report them or even abort a process.

There is, however, a different kind of issue with data: minor problems that can be fixed somehow, as in the following examples:

  • You have a field that contains years. Among the values, you see 2912. This can be considered a typo; assume that the proper value is 2012.
  • You have a string that represents the name of a country, and it is supposed that the names belong to a predefined list of valid countries. You, however, see the values as USA, U.S.A., or United States. On your list, you have only USA as valid, but it is ...

Get Learning Pentaho Data Integration 8 CE - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.