Chapter 26. Handling Bad Data with the Fuzzy Lookup Transform

More often than not when you are working in the real world, data is not going to be perfect like it is in the AdventureWorks database. Real-world situations call for cleansing dirty data or data that has abnormalities like misspellings and truncation.

Imagine you are attempting to retrieve a foreign key from a dimension table, but strangely you find rows without a match. Upon investigation you find bad data is being supplied to you. One technique might be to divert these rows without matches to a table to be dealt with later, or another might be just to add the bad data regardless of misspellings and other mishaps that occur during data entry.

The Fuzzy Lookup Transform, discussed in this lesson, and Fuzzy Grouping Transform, discussed in the next lesson, give another alternative to dealing with dirty data while reducing your number of unmatched rows. The Fuzzy Lookup Transform matches input records with data that has already been cleansed in a reference table. It returns the match and can also indicate the quality of the match. This way you know the likelihood of the match being correct.

A best practice tip is to use the Fuzzy Lookup Transform only after trying a regular lookup on the field. The Fuzzy Lookup Transform is a very expensive operation that builds specialized indexes of the input stream and the reference data for comparison intentions. Therefore, it is recommended to first use a regular Lookup Transform and ...

Get Knight's 24-Hour Trainer: Microsoft® SQL Server® 2008 Integration Services now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.