For many people, the core value of ETL is hidden behind the T, which denotes the Transform capabilities. Many people are still not very comfortable with a single T for all the work that takes place in this phase. Ralph Kimball has referred to ETL as ECCD, short for Extract, Cleanse, Conform, and Deliver, and Matt Casters used the recursive acronym Kettle as a name for the ETL tool where the double T stands for Transportation and Transformation. In any case, this chapter addresses the four of the 34 subsystems that cover cleansing and conforming data, or more specifically:
Subsystem 4: Data-Cleansing
Subsystem 5: Error Event Schema
Subsystem 6: Audit Dimension Assembler
Subsystem 7: Deduplication
The examples in this chapter are all based on the Sakila customer and address tables, but here we have messed up the data a bit in order to make the cleansing steps actually do something. We created duplicate records, misspelled some names, and added a few extra rows with our own information. The script needed to make these modifications,
sakilamods.sql, can be downloaded from the book's companion site at
www.wiley.com/go/kettlesolutions, but you can, of course, make your own modifications as well. The requirements for having the example transformations perform the designated task are described for each example to make it easy to follow along.
A large part of this chapter is devoted to data validation because before you can cleanse data, it must be clear which ...