Chapter 11. Back Room ETL and Data Quality

With the target dimensional models designed, it's time to turn our attention to the back room. Designing and developing behind the scenes extract, transform, and load (ETL) systems consumes the lion's share of effort during a DW/BI project.

This chapter begins with guidance for planning the ETL system, including an overview of the 34 subsystems required to extract, clean, conform, and finally deliver the data to the data warehouse's front room, along with the requisite system management infrastructure. These subsystems must be considered whether you're using an ETL tool or not.

The second section of this chapter dives into data quality concerns. We begin by describing the business benefits of clean data, and then describe a comprehensive architecture for capturing and monitoring quality defects. Specific techniques for cleaning substandard data are discussed.

From there, we focus on building the dimension and fact tables, from surrogate key pipelines to dealing with late-arriving facts and dimensions. Finally, we describe the implications of moving to more real-time ETL processing.

Planning the ETL System

The first two articles in this section were written in late 2004 as The Data Warehouse ETL Toolkit was being released; subsequent articles in this section address the data structures in the back room, the pros and cons of using a commercial ETL tool, as well as techniques for change data capture and integrating data from external parties.

Get The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.