A common situation, in particular when the values were entered manually, is as follows:
- Data with typographical errors
- Mix of upper and lowercase letters
In some languages different from English, it's common to have the following:
- Missing accent marks
- Words with special characters such as tradução or niño are typed as traducao or ninio
As an example, suppose that we have a field containing the names of states in the USA. Among the values, we could have Hawaii, Hawai, and Howaii. Despite being simple typos, none of the steps mentioned earlier would help clean the wrong values so you end up with the proper value, Hawaii. Here comes an alternative technique to fix the issue: fuzzy string searching, a technique ...