10.5. String Scrubbing

The first pass at scrubbing string data is usually to get it into the proper case. In particular, older mainframe data is in EBCDIC and it is upper-case only.

People’s names are one of the hardest pieces of data to scrub because the rules are so irregular. This is one case where the best solution is to get a specialized third-party tool and use it. I strongly recommend getting a copy of The Math, Myth & Magic of Name Search and Matching from SSA (www.searchsoftware.com) for a detailed discussion of the problems.

However, there are some data elements that we can validate with a little more work. We already mentioned the SIMILAR TO predicate for regular expressions. You can find regular expressions at http://regexlib.com/ ...

Get Joe Celko's Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.