You are previewing The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data.
O'Reilly logo
The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data

Book Description

  • Cowritten by Ralph Kimball, the world's leading data warehousing authority, whose previous books have sold more than 150,000 copies

  • Delivers real-world solutions for the most time- and labor-intensive portion of data warehousing-data staging, or the extract, transform, load (ETL) process

  • Delineates best practices for extracting data from scattered sources, removing redundant and inaccurate data, transforming the remaining data into correctly formatted data structures, and then loading the end product into the data warehouse

  • Offers proven time-saving ETL techniques, comprehensive guidance on building dimensional structures, and crucial advice on ensuring data quality

Table of Contents

  1. Copyright
  2. Credits
  3. Acknowledgments
  4. About the Authors
  5. Introduction
  6. I. Requirements, Realities, and Architecture
    1. 1. Surrounding the Requirements
      1. 1.1. Requirements
        1. 1.1.1. Business Needs
        2. 1.1.2. Compliance Requirements
        3. 1.1.3. Data Profiling
        4. 1.1.4. Security Requirements
        5. 1.1.5. Data Integration
        6. 1.1.6. Data Latency
        7. 1.1.7. Archiving and Lineage
        8. 1.1.8. End User Delivery Interfaces
        9. 1.1.9. Available Skills
        10. 1.1.10. Legacy Licenses
      2. 1.2. Architecture
        1. 1.2.1. ETL Tool versus Hand Coding (Buy a Tool Suite or Roll Your Own?)
        2. 1.2.2. The Back Room – Preparing the Data
        3. 1.2.3. The Front Room – Data Access
      3. 1.3. The Mission of the Data Warehouse
        1. 1.3.1. What the Data Warehouse Is
        2. 1.3.2. What the Data Warehouse Is Not
        3. 1.3.3. Industry Terms Not Used Consistently
        4. 1.3.4. Resolving Architectural Conflict: The Hybrid Bus Approach
        5. 1.3.5. How the Data Warehouse Is Changing
      4. 1.4. The Mission of the ETL Team
    2. 2. ETL Data Structures
      1. 2.1. To Stage or Not to Stage
      2. 2.2. Designing the Staging Area
      3. 2.3. Data Structures in the ETL System
        1. 2.3.1. Flat Files
        2. 2.3.2. XML Data Sets
        3. 2.3.3. Relational Tables
        4. 2.3.4. Independent DBMS Working Tables
        5. 2.3.5. Third Normal Form Entity/Relation Models
        6. 2.3.6. Nonrelational Data Sources
        7. 2.3.7. Dimensional Data Models: The Handoff from the Back Room to the Front Room
        8. 2.3.8. Fact Tables
        9. 2.3.9. Dimension Tables
        10. 2.3.10. Atomic and Aggregate Fact Tables
        11. 2.3.11. Surrogate Key Mapping Tables
      4. 2.4. Planning and Design Standards
        1. 2.4.1. Impact Analysis
        2. 2.4.2. Metadata Capture
        3. 2.4.3. Naming Conventions
        4. 2.4.4. Auditing Data Transformation Steps
      5. 2.5. Summary
  7. II. Data Flow
    1. 3. Extracting
      1. 3.1. Part 1: The Logical Data Map
        1. 3.1.1. Designing Logical Before Physical
      2. 3.2. Inside the Logical Data Map
        1. 3.2.1. Components of the Logical Data Map
        2. 3.2.2. Using Tools for the Logical Data Map
      3. 3.3. Building the Logical Data Map
        1. 3.3.1. Data Discovery Phase
        2. 3.3.2. Data Content Analysis
        3. 3.3.3. Collecting Business Rules in the ETL Process
      4. 3.4. Integrating Heterogeneous Data Sources
        1. 3.4.1. Part 2: The Challenge of Extracting from Disparate Platforms
        2. 3.4.2. Connecting to Diverse Sources through ODBC
      5. 3.5. Mainframe Sources
        1. 3.5.1. Working with COBOL Copybooks
        2. 3.5.2. EBCDIC Character Set
        3. 3.5.3. Converting EBCDIC to ASCII
        4. 3.5.4. Transferring Data between Platforms
        5. 3.5.5. Handling Mainframe Numeric Data
        6. 3.5.6. Using PICtures
        7. 3.5.7. Unpacking Packed Decimals
        8. 3.5.8. Working with Redefined Fields
        9. 3.5.9. Multiple OCCURS
        10. 3.5.10. Managing Multiple Mainframe Record Type Files
        11. 3.5.11. Handling Mainframe Variable Record Lengths
      6. 3.6. Flat Files
        1. 3.6.1. Processing Fixed Length Flat Files
        2. 3.6.2. Processing Delimited Flat Files
      7. 3.7. XML Sources
        1. 3.7.1. Character Sets
        2. 3.7.2. XML Meta Data
      8. 3.8. Web Log Sources
        1. 3.8.1. W3C Common and Extended Formats
        2. 3.8.2. Name Value Pairs in Web Logs
      9. 3.9. ERP System Sources
      10. 3.10. Part 3: Extracting Changed Data
        1. 3.10.1. Detecting Changes
        2. 3.10.2. Extraction Tips
        3. 3.10.3. Detecting Deleted or Overwritten Fact Records at the Source
      11. 3.11. Summary
    2. 4. Cleaning and Conforming
      1. 4.1. Defining Data Quality
      2. 4.2. Assumptions
      3. 4.3. Part 1: Design Objectives
        1. 4.3.1. Understand Your Key Constituencies
        2. 4.3.2. Competing Factors
        3. 4.3.3. Balancing Conflicting Priorities
        4. 4.3.4. Formulate a Policy
      4. 4.4. Part 2: Cleaning Deliverables
        1. 4.4.1. Data Profiling Deliverable
        2. 4.4.2. Cleaning Deliverable #1: Error Event Table
        3. 4.4.3. Cleaning Deliverable #2: Audit Dimension
        4. 4.4.4. Audit Dimension Fine Points
      5. 4.5. Part 3: Screens and Their Measurements
        1. 4.5.1. Anomaly Detection Phase
        2. 4.5.2. Types of Enforcement
        3. 4.5.3. Column Property Enforcement
        4. 4.5.4. Structure Enforcement
        5. 4.5.5. Data and Value Rule Enforcement
        6. 4.5.6. Measurements Driving Screen Design
        7. 4.5.7. Overall Process Flow
        8. 4.5.8. The Show Must Go On—Usually
        9. 4.5.9. Screens
        10. 4.5.10. Known Table Row Counts
        11. 4.5.11. Column Nullity
        12. 4.5.12. Column Numeric and Date Ranges
        13. 4.5.13. Column Length Restriction
        14. 4.5.14. Column Explicit Valid Values
        15. 4.5.15. Column Explicit Invalid Values
        16. 4.5.16. Checking Table Row Count Reasonability
        17. 4.5.17. Checking Column Distribution Reasonability
        18. 4.5.18. General Data and Value Rule Reasonability
      6. 4.6. Part 4: Conforming Deliverables
        1. 4.6.1. Conformed Dimensions
        2. 4.6.2. Designing the Conformed Dimensions
        3. 4.6.3. Taking the Pledge
        4. 4.6.4. Permssible Variation of Conformed Dimensions
        5. 4.6.5. Conformed Facts
        6. 4.6.6. The Fact Table Provider
        7. 4.6.7. The Dimension Manager: Publishing Conformed Dimensions to Affected Fact Tables
        8. 4.6.8. Detailed Delivery Steps for Conformed Dimensions
        9. 4.6.9. Implementing the Conforming Modules
        10. 4.6.10. Matching Drives Deduplication
        11. 4.6.11. Surviving: Final Step of Conforming
        12. 4.6.12. Delivering
      7. 4.7. Summary
    3. 5. Delivering Dimension Tables
      1. 5.1. The Basic Structure of a Dimension
      2. 5.2. The Grain of a Dimension
      3. 5.3. The Basic Load Plan for a Dimension
      4. 5.4. Flat Dimensions and Snowflaked Dimensions
      5. 5.5. Date and Time Dimensions
      6. 5.6. Big Dimensions
      7. 5.7. Small Dimensions
      8. 5.8. One Dimension or Two
      9. 5.9. Dimensional Roles
      10. 5.10. Dimensions as Subdimensions of Another Dimension
      11. 5.11. Degenerate Dimensions
      12. 5.12. Slowly Changing Dimensions
      13. 5.13. Type 1 Slowly Changing Dimension (Overwrite)
      14. 5.14. Type 2 Slowly Changing Dimension (Partitioning History)
      15. 5.15. Precise Time Stamping of a Type 2 Slowly Changing Dimension
      16. 5.16. Type 3 Slowly Changing Dimension (Alternate Realities)
      17. 5.17. Hybrid Slowly Changing Dimensions
      18. 5.18. Late-Arriving Dimension Records and Correcting Bad Data
      19. 5.19. Multivalued Dimensions and Bridge Tables
      20. 5.20. Ragged Hierarchies and Bridge Tables
      21. 5.21. Technical Note: POPULATING HIERARCHY BRIDGE TABLES
      22. 5.22. Using Positional Attributes in a Dimension to Represent Text Facts
      23. 5.23. Summary
    4. 6. Delivering Fact Tables
      1. 6.1. The Basic Structure of a Fact Table
      2. 6.2. Guaranteeing Referential Integrity
      3. 6.3. Surrogate Key Pipeline
        1. 6.3.1. Using the Dimension Instead of a Lookup Table
      4. 6.4. Fundamental Grains
      5. 6.5. Transaction Grain Fact Tables
        1. 6.5.1. Periodic Snapshot Fact Tables
        2. 6.5.2. Accumulating Snapshot Fact Tables
      6. 6.6. Preparing for Loading Fact Tables
        1. 6.6.1. Managing Indexes
        2. 6.6.2. Managing Partitions
        3. 6.6.3. Outwitting the Rollback Log
        4. 6.6.4. Loading the Data
        5. 6.6.5. Incremental Loading
        6. 6.6.6. Inserting Facts
        7. 6.6.7. Updating and Correcting Facts
        8. 6.6.8. Negating Facts
        9. 6.6.9. Updating Facts
        10. 6.6.10. Deleting Facts
        11. 6.6.11. Physically Deleting Facts
        12. 6.6.12. Logically Deleting Facts
      7. 6.7. Factless Fact Tables
      8. 6.8. Augmenting a Type 1 Fact Table with Type 2 History
      9. 6.9. Graceful Modifications
      10. 6.10. Multiple Units of Measure in a Fact Table
      11. 6.11. Collecting Revenue in Multiple Currencies
      12. 6.12. Late Arriving Facts
      13. 6.13. Aggregations
        1. 6.13.1. Design Requirement #1
        2. 6.13.2. Design Requirement #2
        3. 6.13.3. Design Requirement #3
        4. 6.13.4. Design Requirement #4
        5. 6.13.5. Administering Aggregations, Including Materialized Views
      14. 6.14. Delivering Dimensional Data to OLAP Cubes
        1. 6.14.1. Cube Data Sources
        2. 6.14.2. Processing Dimensions
        3. 6.14.3. Changes in Dimension Data
        4. 6.14.4. Processing Facts
        5. 6.14.5. Integrating OLAP Processing into the ETL System
        6. 6.14.6. OLAP Wrap-up
      15. 6.15. Summary
  8. III. Implementation and Operations
    1. 7. Development
      1. 7.1. Current Marketplace ETL Tool Suite Offerings
      2. 7.2. Current Scripting Languages
      3. 7.3. Time Is of the Essence
        1. 7.3.1. Push Me or Pull Me
        2. 7.3.2. Ensuring Transfers with Sentinels
        3. 7.3.3. Sorting Data during Preload
        4. 7.3.4. Sorting on Mainframe Systems
        5. 7.3.5. Sorting on Unix and Windows Systems
        6. 7.3.6. Trimming the Fat (Filtering)
        7. 7.3.7. Extracting a Subset of the Source File Records on Mainframe Systems
        8. 7.3.8. Extracting a Subset of the Source File Fields
        9. 7.3.9. Extracting a Subset of the Source File Records on Unix and Windows Systems
        10. 7.3.10. Extracting a Subset of the Source File Fields
        11. 7.3.11. Creating Aggregated Extracts on Mainframe Systems
        12. 7.3.12. Creating Aggregated Extracts on UNIX and Windows Systems
      4. 7.4. Using Database Bulk Loader Utilities to Speed Inserts
        1. 7.4.1. Preparing for Bulk Load
      5. 7.5. Managing Database Features to Improve Performance
        1. 7.5.1. The Order of Things
        2. 7.5.2. The Effect of Aggregates and Group Bys on Performance
        3. 7.5.3. Performance Impact of Using Scalar Functions
        4. 7.5.4. Avoiding Triggers
        5. 7.5.5. Overcoming ODBC the Bottleneck
        6. 7.5.6. Benefiting from Parallel Processing
      6. 7.6. Troubleshooting Performance Problems
      7. 7.7. Increasing ETL Throughput
        1. 7.7.1. Reducing Input/Output Contention
        2. 7.7.2. Eliminating Database Reads/Writes
        3. 7.7.3. Filtering as Soon as Possible
        4. 7.7.4. Partitioning and Parallelizing
        5. 7.7.5. Updating Aggregates Incrementally
        6. 7.7.6. Taking Only What You Need
        7. 7.7.7. Bulk Loading/Eliminating Logging
        8. 7.7.8. Dropping Databases Constraints and Indexes
        9. 7.7.9. Eliminating Network Traffic
        10. 7.7.10. Letting the ETL Engine Do the Work
      8. 7.8. Summary
    2. 8. Operations
      1. 8.1. Scheduling and Support
        1. 8.1.1. Reliability, Availability, Manageability Analysis for ETL
        2. 8.1.2. Etl Scheduling 101
        3. 8.1.3. Scheduling Tools
        4. 8.1.4. Load Dependencies
        5. 8.1.5. Metadata
      2. 8.2. Migrating to Production
        1. 8.2.1. Operational Support for the Data Warehouse
        2. 8.2.2. Bundling Version Releases
        3. 8.2.3. Supporting the ETL System in Production
      3. 8.3. Achieving Optimal ETL Performance
        1. 8.3.1. Estimating Load Time
        2. 8.3.2. Vulnerabilities of Long-Running ETL processes
        3. 8.3.3. Minimizing the Risk of Load Failures
      4. 8.4. Purging Historic Data
      5. 8.5. Monitoring the ETL System
        1. 8.5.1. Measuring ETL Specific Performance Indicators
        2. 8.5.2. Measuring Infrastructure Performance Indicators
        3. 8.5.3. Measuring Data Warehouse Usage to Help Manage ETL Processes
      6. 8.6. Tuning ETL Processes
        1. 8.6.1. Explaining Database Overhead
      7. 8.7. ETL System Security
        1. 8.7.1. Securing the Development Environment
        2. 8.7.2. Securing the Production Environment
      8. 8.8. Short-Term Archiving and Recovery
      9. 8.9. Long-Term Archiving and Recovery
        1. 8.9.1. Media, Formats, Software, and Hardware
        2. 8.9.2. Obsolete Formats and Archaic Formats
        3. 8.9.3. Hard Copy, Standards, and Museums
        4. 8.9.4. Refreshing, Migrating, Emulating, and Encapsulating
      10. 8.10. Summary
    3. 9. Metadata
      1. 9.1. Defining Metadata
        1. 9.1.1. Metadata—What Is It?
        2. 9.1.2. Source System Metadata
        3. 9.1.3. Data-Staging Metadata
        4. 9.1.4. DBMS Metadata
        5. 9.1.5. Front Room Metadata
      2. 9.2. Business Metadata
        1. 9.2.1. Business Definitions
        2. 9.2.2. Source System Information
        3. 9.2.3. Data Warehouse Data Dictionary
        4. 9.2.4. Logical Data Maps
      3. 9.3. Technical Metadata
        1. 9.3.1. System Inventory
        2. 9.3.2. Data Models
        3. 9.3.3. Data Definitions
        4. 9.3.4. Business Rules
      4. 9.4. ETL-Generated Metadata
        1. 9.4.1. ETL Job Metadata
        2. 9.4.2. Transformation Metadata
        3. 9.4.3. Batch Metadata
        4. 9.4.4. Data Quality Error Event Metadata
        5. 9.4.5. Process Execution Metadata
      5. 9.5. Metadata Standards and Practices
        1. 9.5.1. Establishing Rudimentary Standards
        2. 9.5.2. Naming Conventions
      6. 9.6. Impact Analysis
      7. 9.7. Summary
    4. 10. Responsibilities
      1. 10.1. Planning and Leadership
        1. 10.1.1. Having Dedicated Leadership
        2. 10.1.2. Planning Large, Building Small
        3. 10.1.3. Hiring Qualified Developers
        4. 10.1.4. Building Teams with Database Expertise
        5. 10.1.5. Don't Try to Save the World
        6. 10.1.6. Enforcing Standardization
        7. 10.1.7. Monitoring, Auditing, and Publishing Statistics
        8. 10.1.8. Maintaining Documentation
        9. 10.1.9. Providing and Utilizing Metadata
        10. 10.1.10. Keeping It Simple
        11. 10.1.11. Optimizing Throughput
      2. 10.2. Managing the Project
        1. 10.2.1. Responsibility of the ETL Team
        2. 10.2.2. Defining the Project
        3. 10.2.3. Planning the Project
        4. 10.2.4. Determining the Tool Set
        5. 10.2.5. Staffing Your Project
        6. 10.2.6. Project Plan Guidelines
        7. 10.2.7. Managing Scope
      3. 10.3. Summary
  9. IV. Real Time Streaming ETL Systems
    1. 11. Real-Time ETL Systems
      1. 11.1. Why Real-Time ETL?
      2. 11.2. Defining Real-Time ETL
      3. 11.3. Challenges and Opportunities of Real-Time Data Warehousing
      4. 11.4. Real-Time Data Warehousing Review
        1. 11.4.1. Generation 1—The Operational Data Store
        2. 11.4.2. Generation 2—The Real-Time Partition
        3. 11.4.3. Recent CRM Trends
        4. 11.4.4. The Strategic Role of the Dimension Manager
      5. 11.5. Categorizing the Requirement
        1. 11.5.1. Data Freshness and Historical Needs
        2. 11.5.2. Reporting Only or Integration, Too?
        3. 11.5.3. Just the Facts or Dimension Changes, Too?
        4. 11.5.4. Alerts, Continuous Polling, or Nonevents?
        5. 11.5.5. Data Integration or Application Integration?
        6. 11.5.6. Point-to-Point versus Hub-and-Spoke
        7. 11.5.7. Customer Data Cleanup Considerations
      6. 11.6. Real-Time ETL Approaches
        1. 11.6.1. Microbatch ETL
        2. 11.6.2. Enterprise Application Integration
        3. 11.6.3. Capture, Transform, and Flow
        4. 11.6.4. Enterprise Information Integration
        5. 11.6.5. The Real-Time Dimension Manager
        6. 11.6.6. Microbatch Processing
        7. 11.6.7. Choosing an Approach—A Decision Guide
      7. 11.7. Summary
    2. 12. Conclusions
      1. 12.1. Deepening the Definition of ETL
      2. 12.2. The Future of Data Warehousing and ETL in Particular
        1. 12.2.1. Ongoing Evolution of ETL Systems