6

Data-Cleansing Design

At some point in your career, you’ve probably been told that a given file should always be clean, and there’s no reason to spend time working on procedures to protect yourself from a problem. Inevitably, the impossible happens, and that previously perfect file has a problem, causing you to wake up at two o’clock in the morning to correct the production problem. If this has ever happened to you, or if you’re trying to not fall victim to this scenario, then this chapter is for you.

In the classic Problem-Design-Solution pattern, this chapter teaches you how to protect your package from breakages from future data problems, and provides new patterns for exploring every realm of data cleansing prior to loading or updating the data. Here’s what you can expect in this chapter:

  • “Problem” — The “Problem” section answers the question, “How clean is my data?” The discussion shows how to determine the answer to this question by employing the built-in functionality of the Data Profiling Task.
  • “Design” — After the file or data set is in the pipeline, you may need to protect yourself from unexpected conditions prior to loading the data. In the “Design” section of this chapter, you learn about the methods for using the Script transform to either cleanse or direct bad data elsewhere. You also see how to use the Fuzzy Grouping and Lookup transforms to match the data on non-exact means. Finally, you find out how to cleanse a single file that has two different types of data ...

Get Microsoft® SQL Server® 2008 Integration Services: Problem-Design-Solution now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.