Chapter 30

Handling Bad Data with the Fuzzy Lookup

More often than not, when you are working in the real world, data is not going to be perfect like it is in the AdventureWorks2012 database. Real-world situations call for cleansing dirty data or data that has abnormalities like misspellings or truncation.

Imagine you are attempting to retrieve a foreign key from a dimension table, but, strangely, you find rows without a match. Upon investigation, you find bad data is being supplied to you. One technique might be to divert these rows without matches to a table to be dealt with later; another might be to just add the bad data regardless of misspellings and other mishaps that occur during data entry.

The Fuzzy Lookup Transform, discussed in this lesson, and the Fuzzy Grouping Transform, discussed in the next lesson, gives other alternatives to dealing with dirty data while reducing your number of unmatched rows. The Fuzzy Lookup Transform matches input records with data that has already been cleansed in a reference table. It returns the match and can also indicate the quality of the match. This way you know the likelihood of the match being correct.

NOTE A best practice tip is to use the Fuzzy Lookup Transform only after trying a regular lookup on the field first. The Fuzzy Lookup Transform is a very expensive operation that builds specialized indexes of the input stream and the reference data for comparison purposes. Therefore, it is recommended to first use a regular Lookup Transform ...

Get Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.