Dealing with non-exact matches

A common situation, in particular when the values were entered manually, is as follows:

  • Data with typographical errors
  • Mix of upper and lowercase letters

In some languages different from English, it's common to have the following:

  • Missing accent marks
  • Words with special characters such as tradução or niño are typed as traducao or ninio

As an example, suppose that we have a field containing the names of states in the USA. Among the values, we could have  HawaiiHawai, and Howaii. Despite being simple typos, none of the steps mentioned earlier would help clean the wrong values so you end up with the proper value, Hawaii. Here comes an alternative technique to fix the issue: fuzzy string searching, a technique ...

Get Learning Pentaho Data Integration 8 CE - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.