Chapter 31

Removing Duplicates with the Fuzzy Grouping Transform

In the previous lesson, you saw how to use the Fuzzy Lookup Transform to prevent bad data from being loaded in your dimension tables, but what if the bad data is already in your table or if you are just beginning to build your data warehouse?

In these circumstances, you can use the Fuzzy Grouping Transform to examine the contents of suspect fields and provide groupings of similar words. You can use the matching information provided by this transform to clean up the table and eliminate redundancy.

The Fuzzy Grouping Transform uses the same logic as the Fuzzy Lookup Transform, and therefore requires many of the same things. It must have a connection to an OLE DB Connection Manager to generate temporary tables that the transform uses in its algorithm. At development time, the Connection Manager tab is where you make this setting.

Also, just as was the case with the Fuzzy Lookup Transform, this transform expects an input stream with a string, either DT_WSTR or DT_STR data type. The Columns tab of the Fuzzy Grouping Transformation Editor (which you open by double-clicking the transform), shown in Figure 31-1, is where you select the string field that you want to be analyzed and grouped into logical matches. Notice in the top part of the Columns tab that you can also check Pass Through on each column, which means the data is not analyzed, but is accessible in the output stream. If you move down to the bottom part of the ...

Get Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.