Chapter 27. Removing Duplicates with the Fuzzy Grouping Transform

In the previous lesson you saw how to prevent bad data from being loaded in your dimension tables with the Fuzzy Lookup Transform, but what if the bad data is already in your table or if you are just beginning to build your data warehouse?

In these circumstances you can use the Fuzzy Grouping Transform to examine the contents of suspect fields and provide groupings of similar words. The matching information provided by this transform can be used to clean up the table and eliminate redundancy.

The Fuzzy Grouping Transform uses the same logic as the Fuzzy Lookup Transform and therefore requires many of the same things. It must have a connection to an OLE DB Connection Manager to generate temporary tables that the transform uses in its algorithm. At development time the Connection Manager tab is where you make this setting.

Also just as was the case with the Fuzzy Lookup Transform, this transform expects an input stream with a string, either a DT_WSTR or DT_STR data type. The Columns tab of the Fuzzy Grouping Transform Editor (which you open by double-clicking the transform), shown in Figure 27-1, is where you select the string field that you want to be analyzed and grouped into logical matches. Notice in the top part of the Columns tab that each column can also be selected to Pass Through, which means the data is not analyzed, but is accessible in the output stream. If you move down to the bottom part of the Columns tab, ...

Get Knight's 24-Hour Trainer: Microsoft® SQL Server® 2008 Integration Services now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.