Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration

Book description

A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL

This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions—before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.

  • Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)

  • Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace

  • Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle

  • Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud"

Get the most out of Pentaho Kettle and your data warehousing with this detailed guide—from simple single table data migration to complex multisystem clustered data integration tasks.

Table of contents

  1. Copyright
  2. About the Authors
  3. Credits
  4. Acknowledgments
  5. Introduction
    1. The Origins of Kettle
      1. The Design of Kettle
      2. Kettle Gets Some Traction
      3. Kettle Goes Open Source
    2. About This Book
      1. Who Should Read This Book
      2. What You Will Need to Use This Book
      3. What You Will Learn from This Book
    3. How This Book Is Organized
      1. Part I: Getting Started
      2. ETL
      3. Part III: Management and Deployment
      4. Part IV: Performance and Scalability
      5. Part V: Advanced Topics
      6. Appendixes
    4. Prerequisites
      1. Java
      2. MySQL
      3. SQL Power Architect
      4. Eclipse
    5. On the Website
    6. Further Resources
  6. I. Getting Started
    1. 1. ETL Primer
      1. 1.1. OLTP versus Data Warehousing
      2. 1.2. What Is ETL?
        1. 1.2.1. The Evolution of ETL Solutions
        2. 1.2.2. ETL Building Blocks
      3. 1.3. ETL, ELT, and EII
        1. 1.3.1. ELT
        2. 1.3.2. EII: Virtual Data Integration
      4. 1.4. Data Integration Challenges
        1. 1.4.1. Methodology: Agile BI
        2. 1.4.2. ETL Design
        3. 1.4.3. Data Acquisition
          1. 1.4.3.1. Beware of Spreadsheets
          2. 1.4.3.2. Design for Failure
          3. 1.4.3.3. Change Data Capture
        4. 1.4.4. Data Quality
          1. 1.4.4.1. Data Profiling
          2. 1.4.4.2. Data Validation
      5. 1.5. ETL Tool Requirements
        1. 1.5.1. Connectivity
        2. 1.5.2. Platform Independence
        3. 1.5.3. Scalability
        4. 1.5.4. Design Flexibility
        5. 1.5.5. Reuse
        6. 1.5.6. Extensibility
        7. 1.5.7. Data Transformations
        8. 1.5.8. Testing and Debugging
        9. 1.5.9. Lineage and Impact Analysis
        10. 1.5.10. Logging and Auditing
      6. 1.6. Summary
    2. 2. Kettle Concepts
      1. 2.1. Design Principles
      2. 2.2. The Building Blocks of Kettle Design
        1. 2.2.1. Transformations
          1. 2.2.1.1. Steps
          2. 2.2.1.2. Transformation Hops
          3. 2.2.1.3. Parallelism
          4. 2.2.1.4. Rows of Data
          5. 2.2.1.5. Data Conversion
            1. 2.2.1.5.1. Date to String Conversion
            2. 2.2.1.5.2. Numeric to String Conversion
            3. 2.2.1.5.3. Other Conversions
        2. 2.2.2. Jobs
          1. 2.2.2.1. Job Entries
          2. 2.2.2.2. Job Hops
          3. 2.2.2.3. Multiple Paths and Backtracking
          4. 2.2.2.4. Parallel Execution
          5. 2.2.2.5. Job Entry Results
        3. 2.2.3. Transformation or Job Metadata
        4. 2.2.4. Database Connections
          1. 2.2.4.1. Special Options
          2. 2.2.4.2. The Power of the Relational Database
          3. 2.2.4.3. Connections and Transactions
          4. 2.2.4.4. Database Clustering
        5. 2.2.5. Tools and Utilities
        6. 2.2.6. Repositories
        7. 2.2.7. Virtual File Systems
      3. 2.3. Parameters and Variables
        1. 2.3.1. Defining Variables
        2. 2.3.2. Named Parameters
        3. 2.3.3. Using Variables
      4. 2.4. Visual Programming
        1. 2.4.1. Getting Started
        2. 2.4.2. Creating New Steps
        3. 2.4.3. Putting It All Together
      5. 2.5. Summary
    3. 3. Installation and Configuration
      1. 3.1. Kettle Software Overview
        1. 3.1.1. Integrated Development Environment: Spoon
        2. 3.1.2. Command-Line Launchers: Kitchen and Pan
        3. 3.1.3. Job Server: Carte
        4. 3.1.4. Encr.bat and encr.sh
      2. 3.2. Installation
        1. 3.2.1. Java Environment
          1. 3.2.1.1. Installing Java Manually
          2. 3.2.1.2. Using Your Linux Package Management System
        2. 3.2.2. Installing Kettle
          1. 3.2.2.1. Versions and Releases
          2. 3.2.2.2. Archive Names and Formats
          3. 3.2.2.3. Downloading and Uncompressing
          4. 3.2.2.4. Running Kettle Programs
            1. 3.2.2.4.1. Windows
            2. 3.2.2.4.2. UNIX-like systems
          5. 3.2.2.5. Creating a Shortcut Icon or Launcher for Spoon
            1. 3.2.2.5.1. Adding a Windows Shortcut
            2. 3.2.2.5.2. Creating a Launcher for the GNOME Desktop
      3. 3.3. Configuration
        1. 3.3.1. Configuration Files and the .kettle Directory
          1. 3.3.1.1.
            1. 3.3.1.1.1. .spoonrc
            2. 3.3.1.1.2. jdbc.properties
            3. 3.3.1.1.3. kettle.properties
            4. 3.3.1.1.4. kettle.pwd
            5. 3.3.1.1.5. repositories.xml
            6. 3.3.1.1.6. shared.xml
        2. 3.3.2. The Kettle Shell Scripts
          1. 3.3.2.1. General Structure of the Startup Scripts
          2. 3.3.2.2. Adding an Entry to the Classpath
          3. 3.3.2.3. Changing the Maximum Heap Size
        3. 3.3.3. Managing JDBC Drivers
      4. 3.4. Summary
    4. 4. An Example ETL Solution—Sakila
      1. 4.1. Sakila
        1. 4.1.1. The Sakila Sample Database
          1. 4.1.1.1. DVD Rental Business Process
          2. 4.1.1.2. Sakila Database Schema Diagram
          3. 4.1.1.3. Sakila Database Subject Areas
          4. 4.1.1.4. General Design Considerations
          5. 4.1.1.5. Installing the Sakila Sample Database
        2. 4.1.2. The Rental Star Schema
          1. 4.1.2.1. Rental Star Schema Diagram
          2. 4.1.2.2. Rental Fact Table
          3. 4.1.2.3. Dimension Tables
          4. 4.1.2.4. Keys and Change Data Capture
          5. 4.1.2.5. Installing the Rental Star Schema
      2. 4.2. Prerequisites and Some Basic Spoon Skills
        1. 4.2.1. Setting Up the ETL Solution
          1. 4.2.1.1. Creating Database Accounts
        2. 4.2.2. Working with Spoon
          1. 4.2.2.1. Opening Transformation and Job Files
          2. 4.2.2.2. Opening the Step's Configuration Dialog
          3. 4.2.2.3. Examining Streams
          4. 4.2.2.4. Running Jobs and Transformations
      3. 4.3. The Sample ETL Solution
        1. 4.3.1. Static, Generated Dimensions
          1. 4.3.1.1. Loading the dim_date Dimension Table
            1. 4.3.1.1.1. Generate 10 years
            2. 4.3.1.1.2. Day Sequence
            3. 4.3.1.1.3. Calculate Dimension Attributes
            4. 4.3.1.1.4. Load dim_date
            5. 4.3.1.1.5. Running the Transformation
          2. 4.3.1.2. Loading the dim_time Dimension Table
            1. 4.3.1.2.1. Generating Hours, Minutes, and Seconds
            2. 4.3.1.2.2. Cartesian Product
            3. 4.3.1.2.3. Calculate Time and Table out dim_time
            4. 4.3.1.2.4. Running the Transformation
        2. 4.3.2. Recurring Load
          1. 4.3.2.1. The load_rentals Job
            1. 4.3.2.1.1. Start
            2. 4.3.2.1.2. Various Transformation Job Entries
            3. 4.3.2.1.3. Different Types of Hops and Flow of Execution
            4. 4.3.2.1.4. Serial Execution
            5. 4.3.2.1.5. Mail Job Entries
            6. 4.3.2.1.6. Running the Job
          2. 4.3.2.2. The load_dim_staff Transformation
          3. 4.3.2.3. Database Connections
            1. 4.3.2.3.1. Changed Data Capture and Extraction
            2. 4.3.2.3.2. Converting and Recoding the Active Flag
            3. 4.3.2.3.3. Loading the dim_staff Type 2 Slowly Changing Dimension Table
          4. 4.3.2.4. The load_dim_customer Transformation
            1. 4.3.2.4.1. The Fetch Customer Address SubTransformation
          5. 4.3.2.5. The load_dim_store Transformation
          6. 4.3.2.6. The fetch_address Subtransformation
            1. 4.3.2.6.1. Address Data Lookup Cascade
            2. 4.3.2.6.2. More Denormalization: Concatenating Address Lines
            3. 4.3.2.6.3. Subtransformation Interface
          7. 4.3.2.7. The load_dim_actor Transformation
            1. 4.3.2.7.1. The Insert / Update Step Type
          8. 4.3.2.8. The load_dim_film Transformation
            1. 4.3.2.8.1. Splitting the special_features List
            2. 4.3.2.8.2. Flattening Individual Special Features to Yes/No Flags
            3. 4.3.2.8.3. Creating Flags for the Film Categories
            4. 4.3.2.8.4. Loading the dim_film Table
            5. 4.3.2.8.5. Loading the dim_film_actor_bridge Table
          9. 4.3.2.9. The load_fact_rental Transformation
            1. 4.3.2.9.1. Changed Data Capture
            2. 4.3.2.9.2. Obtaining Date and Time Smart Keys
            3. 4.3.2.9.3. Calculating Metrics
            4. 4.3.2.9.4. Looking Up Keys of Regular Dimension Tables
            5. 4.3.2.9.5. Looking Up Keys of Type 2 Slowly Changing Dimension Tables
            6. 4.3.2.9.6. Loading the Fact Table
      4. 4.4. Summary
  7. II. ETL
    1. 5. ETL Subsystems
      1. 5.1. Introduction to the 34 Subsystems
        1. 5.1.1. Extraction
          1. 5.1.1.1. Subsystems 1–3: Data Profiling, Change Data Capture, and Extraction
        2. 5.1.2. Cleaning and Conforming Data
          1. 5.1.2.1. Subsystem 4: Data Cleaning and Quality Screen Handler System
          2. 5.1.2.2. Subsystem 5: Error Event Handler
          3. 5.1.2.3. Subsystem 6: Audit Dimension Assembler
          4. 5.1.2.4. Subsystem 7: Deduplication System
          5. 5.1.2.5. Subsystem 8: Data Conformer
        3. 5.1.3. Data Delivery
          1. 5.1.3.1. Subsystem 9: Slowly Changing Dimension Processor
          2. 5.1.3.2. Subsystem 10: Surrogate Key Creation System
          3. 5.1.3.3. Subsystem 11: Hierarchy Dimension Builder
          4. 5.1.3.4. Subsystem 12: Special Dimension Builder
          5. 5.1.3.5. Subsystem 13: Fact Table Loader
          6. 5.1.3.6. Subsystem 14: Surrogate Key Pipeline
          7. 5.1.3.7. Subsystem 15: Multi-Valued Dimension Bridge Table Builder
          8. 5.1.3.8. Subsystem 16: Late-Arriving Data Handler
          9. 5.1.3.9. Subsystem 17: Dimension Manager System
          10. 5.1.3.10. Subsystem 18: Fact Table Provider System
          11. 5.1.3.11. Subsystem 19: Aggregate Builder
          12. 5.1.3.12. Subsystem 20: Multidimensional (OLAP) Cube Builder
          13. 5.1.3.13. Subsystem 21: Data Integration Manager
        4. 5.1.4. Managing the ETL Environment
      2. 5.2. Summary
    2. 6. Data Extraction
      1. 6.1. Kettle Data Extraction Overview
        1. 6.1.1. File-Based Extraction
          1. 6.1.1.1. Working with Text Files
          2. 6.1.1.2. Working with XML files
          3. 6.1.1.3. Special File Types
        2. 6.1.2. Database-Based Extraction
        3. 6.1.3. Web-Based Extraction
          1. 6.1.3.1. Text-Based Web Extraction
          2. 6.1.3.2. HTTP Client
          3. 6.1.3.3. Using SOAP
        4. 6.1.4. Stream-Based and Real-Time Extraction
      2. 6.2. Working with ERP and CRM Systems
        1. 6.2.1. ERP Challenges
        2. 6.2.2. Kettle ERP Plugins
        3. 6.2.3. Working with SAP Data
        4. 6.2.4. ERP and CDC Issues
      3. 6.3. Data Profiling
        1. 6.3.1. Using eobjects.org DataCleaner
          1. 6.3.1.1. Adding Profile Tasks
          2. 6.3.1.2. Adding Database Connections
          3. 6.3.1.3. Doing an Initial Profile
          4. 6.3.1.4. Working with Regular Expressions
          5. 6.3.1.5. Profiling and Exploring Results
          6. 6.3.1.6. Validating and Comparing Data
          7. 6.3.1.7. Using a Dictionary for Column Dependency Checks
          8. 6.3.1.8. Alternative Solutions
          9. 6.3.1.9. Text Profiling with Kettle
      4. 6.4. CDC: Change Data Capture
        1. 6.4.1. Source Data–Based CDC
        2. 6.4.2. Trigger-Based CDC
        3. 6.4.3. Snapshot-Based CDC
        4. 6.4.4. Log-Based CDC
        5. 6.4.5. Which CDC Alternative Should You Choose?
      5. 6.5. Delivering Data
      6. 6.6. Summary
    3. 7. Cleansing and Conforming
      1. 7.1. Data Cleansing
        1. 7.1.1. Data-Cleansing Steps
        2. 7.1.2. Using Reference Tables
          1. 7.1.2.1. Conforming Data Using Lookup Tables
          2. 7.1.2.2. Conforming Data Using Reference Tables
        3. 7.1.3. Data Validation
          1. 7.1.3.1. Applying Validation Rules
          2. 7.1.3.2. Validating Dependency Constraints
      2. 7.2. Error Handling
        1. 7.2.1. Handling Process Errors
          1. 7.2.1.1. Transformation Errors
        2. 7.2.2. Handling Data (Validation) Errors
      3. 7.3. Auditing Data and Process Quality
      4. 7.4. Deduplicating Data
        1. 7.4.1. Handling Exact Duplicates
        2. 7.4.2. The Problem of Non-Exact Duplicates
        3. 7.4.3. Building Deduplication Transforms
          1. 7.4.3.1. Step 1: Fuzzy Match
          2. 7.4.3.2. Step 2: Select Suspects
          3. 7.4.3.3. Step 3: Lookup Validation Value
          4. 7.4.3.4. Step 4: Filter Duplicates
      5. 7.5. Scripting
        1. 7.5.1. Formula
        2. 7.5.2. JavaScript
        3. 7.5.3. User-Defined Java Expressions
        4. 7.5.4. Regular Expressions
      6. 7.6. Summary
    4. 8. Handling Dimension Tables
      1. 8.1. Managing Keys
        1. 8.1.1. Managing Business Keys
          1. 8.1.1.1. Keys in the Source System
          2. 8.1.1.2. Keys in the Data Warehouse
          3. 8.1.1.3. Business Keys
          4. 8.1.1.4. Storing Business Keys
          5. 8.1.1.5. Looking Up Keys with Kettle
        2. 8.1.2. Generating Surrogate Keys
          1. 8.1.2.1. The "Add sequence" Step
            1. 8.1.2.1.1. Using Internal Counters for the "Add sequence" Step
            2. 8.1.2.1.2. Generating Surrogate Keys Based on a Counter
            3. 8.1.2.1.3. Dynamically Configuring the Sequence Offset
            4. 8.1.2.1.4. Surrogate Keys Based on a Database Sequence
          2. 8.1.2.2. Working with auto_increment or IDENTITY Columns
          3. 8.1.2.3. Keys for Slowly Changing Dimensions
      2. 8.2. Loading Dimension Tables
        1. 8.2.1. Snowflaked Dimension Tables
          1. 8.2.1.1. Top-Down Level-Wise Loading
          2. 8.2.1.2. Sakila Snowflake Example
          3. 8.2.1.3. Sample Transformation
          4. 8.2.1.4. Database Lookup Configuration
            1. 8.2.1.4.1. Connection, Lookup Schema, and Lookup Table
            2. 8.2.1.4.2. Lookup Key
            3. 8.2.1.4.3. Lookup Value
            4. 8.2.1.4.4. Cache Configuration
            5. 8.2.1.4.5. Lookup Failure
          5. 8.2.1.5. Sample Job
        2. 8.2.2. Star Schema Dimension Tables
          1. 8.2.2.1. Denormalization
          2. 8.2.2.2. Denormalizing to 1NF with the "Database lookup" Step
          3. 8.2.2.3. Change Data Capture
      3. 8.3. Slowly Changing Dimensions
        1. 8.3.1. Types of Slowly Changing Dimensions
        2. 8.3.2. Type 1 Slowly Changing Dimensions
          1. 8.3.2.1. The Insert / Update Step
            1. 8.3.2.1.1. Connection, Target Schema and Table, and Commit Size
            2. 8.3.2.1.2. Target Key
            3. 8.3.2.1.3. Update Fields
        3. 8.3.3. Type 2 Slowly Changing Dimensions
          1. 8.3.3.1. The "Dimension lookup / update" Step
            1. 8.3.3.1.1. Specifying the Mode of Operation
            2. 8.3.3.1.2. General Configuration
            3. 8.3.3.1.3. Keys Tab Page
            4. 8.3.3.1.4. Surrogate Key
            5. 8.3.3.1.5. History Maintenance
            6. 8.3.3.1.6. Lookup / Update Fields
        4. 8.3.4. Other Types of Slowly Changing Dimensions
          1. 8.3.4.1. Type 3 Slowly Changing Dimensions
          2. 8.3.4.2. Hybrid Slowly Changing Dimensions
      4. 8.4. More Dimensions
        1. 8.4.1. Generated Dimensions
          1. 8.4.1.1. Date and Time Dimensions
          2. 8.4.1.2. Generated Mini-Dimensions
        2. 8.4.2. Junk Dimensions
        3. 8.4.3. Recursive Hierarchies
      5. 8.5. Summary
    5. 9. Loading Fact Tables
      1. 9.1. Loading in Bulk
        1. 9.1.1. STDIN and FIFO
        2. 9.1.2. Kettle Bulk Loaders
          1. 9.1.2.1. MySQL Bulk Loading
          2. 9.1.2.2. LucidDB Bulk Loader
          3. 9.1.2.3. Oracle Bulk Loader
          4. 9.1.2.4. PostgreSQL Bulk Loader
          5. 9.1.2.5. Table Output Step
        3. 9.1.3. General Bulk Load Considerations
      2. 9.2. Dimension Lookups
        1. 9.2.1. Maintaining Referential Integrity
        2. 9.2.2. The Surrogate Key Pipeline
          1. 9.2.2.1. Using In-Memory Lookups
          2. 9.2.2.2. Stream Lookups
        3. 9.2.3. Late-Arriving Data
          1. 9.2.3.1. Late-Arriving Facts
          2. 9.2.3.2. Late-Arriving Dimensions
      3. 9.3. Fact Table Handling
        1. 9.3.1. Periodic and Accumulating Snapshots
        2. 9.3.2. Introducing State-Oriented Fact Tables
        3. 9.3.3. Loading Periodic Snapshots
        4. 9.3.4. Loading Accumulating Snapshots
        5. 9.3.5. Loading State-Oriented Fact Tables
        6. 9.3.6. Loading Aggregate Tables
      4. 9.4. Summary
    6. 10. Working with OLAP Data
      1. 10.1. OLAP Benefits and Challenges
        1. 10.1.1. OLAP Storage Types
        2. 10.1.2. Positioning OLAP
        3. 10.1.3. Kettle OLAP Options
      2. 10.2. Working with Mondrian
      3. 10.3. Working with XML/A Servers
      4. 10.4. Working with Palo
        1. 10.4.1. Setting Up the Palo Connection
        2. 10.4.2. Palo Architecture
        3. 10.4.3. Reading Palo Data
        4. 10.4.4. Writing Palo Data
      5. 10.5. Summary
  8. III. Management and Deployment
    1. 11. ETL Development Lifecycle
      1. 11.1. Solution Design
        1. 11.1.1. Best and Bad Practices
          1. 11.1.1.1. Data Mapping
          2. 11.1.1.2. Naming and Commentary Conventions
          3. 11.1.1.3. Common Pitfalls
        2. 11.1.2. ETL Flow Design
        3. 11.1.3. Reusability and Maintainability
      2. 11.2. Agile Development
      3. 11.3. Testing and Debugging
        1. 11.3.1. Test Activities
        2. 11.3.2. ETL Testing
          1. 11.3.2.1. Test Data Requirements
          2. 11.3.2.2. Testing for Completeness
          3. 11.3.2.3. Testing Data Transformations
          4. 11.3.2.4. Test Automation and Continuous Integration
          5. 11.3.2.5. Upgrade Tests
        3. 11.3.3. Debugging
      4. 11.4. Documenting the Solution
        1. 11.4.1. Why Isn't There Any Documentation?
          1. 11.4.1.1. Myth 1: My Software Is Self-Explanatory
          2. 11.4.1.2. Myth 2: Documentation Is Always Outdated
          3. 11.4.1.3. Myth 3: Who Reads Documentation Anyway?
        2. 11.4.2. Kettle Documentation Features
        3. 11.4.3. Generating Documentation
      5. 11.5. Summary
    2. 12. Scheduling and Monitoring
      1. 12.1. Scheduling
        1. 12.1.1. Operating System–Level Scheduling
          1. 12.1.1.1. Executing Kettle Jobs and Transformations from the Command Line
            1. 12.1.1.1.1. Command-Line Parameters
            2. 12.1.1.1.2. Running Jobs with Kitchen
            3. 12.1.1.1.3. Running Transformations with Pan
            4. 12.1.1.1.4. Using Custom Command-Line Parameters
          2. 12.1.1.2. UNIX-Based Systems: cron
          3. 12.1.1.3. Windows: The at utility and the Task Scheduler
        2. 12.1.2. Using Pentaho's Built-in Scheduler
          1. 12.1.2.1. Creating an Action Sequence to Run Kettle Jobs and Transformations
          2. 12.1.2.2. Kettle Transformations in Action Sequences
          3. 12.1.2.3. Creating and Maintaining Schedules with the Administration Console
          4. 12.1.2.4. Attaching an Action Sequence to a Schedule
      2. 12.2. Monitoring
        1. 12.2.1. Logging
          1. 12.2.1.1. Inspecting the Log
          2. 12.2.1.2. Logging Levels
          3. 12.2.1.3. Writing Custom Messages to the Log
        2. 12.2.2. E-mail Notifications
          1. 12.2.2.1. Configuring the Mail Job Entry
            1. 12.2.2.1.1. Addresses Tab
            2. 12.2.2.1.2. Server Tab
            3. 12.2.2.1.3. EMail Message Tab
            4. 12.2.2.1.4. Attached Files Tab
      3. 12.3. Summary
    3. 13. Versioning and Migration
      1. 13.1. Version Control Systems
        1. 13.1.1. File-Based Version Control Systems
          1. 13.1.1.1. Organization
          2. 13.1.1.2. Leading File-Based VCSs
        2. 13.1.2. Content Management Systems
      2. 13.2. Kettle Metadata
        1. 13.2.1. Kettle XML Metadata
          1. 13.2.1.1. Transformation XML
          2. 13.2.1.2. Job XML
          3. 13.2.1.3. Global Replace
        2. 13.2.2. Kettle Repository Metadata
          1. 13.2.2.1. The Kettle Database Repository Type
          2. 13.2.2.2. The Kettle File Repository Type
          3. 13.2.2.3. The Kettle Enterprise Repository Type
      3. 13.3. Managing Repositories
        1. 13.3.1. Exporting and Importing Repositories
        2. 13.3.2. Upgrading Your Repository
      4. 13.4. Version Migration System
        1. 13.4.1. Managing XML Files
        2. 13.4.2. Managing Repositories
        3. 13.4.3. Parameterizing Your Solution
      5. 13.5. Summary
    4. 14. Lineage and Auditing
      1. 14.1. Batch-Level Lineage Extraction
      2. 14.2. Lineage
        1. 14.2.1. Lineage Information
        2. 14.2.2. Impact Analysis Information
      3. 14.3. Logging and Operational Metadata
        1. 14.3.1. Logging Basics
        2. 14.3.2. Logging Architecture
          1. 14.3.2.1. Setting a Maximum Buffer Size
          2. 14.3.2.2. Setting a Maximum Log Line Age
          3. 14.3.2.3. Log Channels
          4. 14.3.2.4. Log Text Capturing in a Job
        3. 14.3.3. Logging Tables
          1. 14.3.3.1. Transformation Logging Tables
            1. 14.3.3.1.1. The Transformation Log Table
            2. 14.3.3.1.2. The Step Log Table
            3. 14.3.3.1.3. The Performance Log Table
            4. 14.3.3.1.4. The Logging Channels Log Table
          2. 14.3.3.2. Job Logging Tables
            1. 14.3.3.2.1. The Job Log Table
            2. 14.3.3.2.2. The Job Entry Log Table
      4. 14.4. Summary
  9. IV. Performance and Scalability
    1. 15. Performance Tuning
      1. 15.1. Transformation Performance: Finding the Weakest Link
        1. 15.1.1. Finding Bottlenecks by Simplifying
        2. 15.1.2. Finding Bottlenecks by Measuring
        3. 15.1.3. Copying Rows of Data
      2. 15.2. Improving Transformation Performance
        1. 15.2.1. Improving Performance in Reading Text Files
          1. 15.2.1.1. Using Lazy Conversion for Reading Text Files
          2. 15.2.1.2. Single-File Parallel Reading
          3. 15.2.1.3. Multi-File Parallel Reading
          4. 15.2.1.4. Configuring the NIO Block Size
          5. 15.2.1.5. Changing Disks and Reading Text Files
        2. 15.2.2. Improving Performance in Writing Text Files
          1. 15.2.2.1. Using Lazy Conversion for Writing Text Files
          2. 15.2.2.2. Parallel Files Writing
          3. 15.2.2.3. Changing Disks and Writing Text Files
        3. 15.2.3. Improving Database Performance
          1. 15.2.3.1. Avoiding Dynamic SQL
          2. 15.2.3.2. Handling Roundtrips
            1. 15.2.3.2.1. Reducing Network Latency
            2. 15.2.3.2.2. Network Speed
          3. 15.2.3.3. Handling Relational Databases
            1. 15.2.3.3.1. Commit Size
            2. 15.2.3.3.2. Indexes
            3. 15.2.3.3.3. Table Partitioning
            4. 15.2.3.3.4. Constraints
            5. 15.2.3.3.5. Triggers
        4. 15.2.4. Sorting Data
          1. 15.2.4.1. Sorting on the Database
          2. 15.2.4.2. Sorting in Parallel
        5. 15.2.5. Reducing CPU Usage
          1. 15.2.5.1. Optimizing the Use of JavaScript
          2. 15.2.5.2. Launching Multiple Copies of a Step
          3. 15.2.5.3. Selecting and Removing Values
          4. 15.2.5.4. Managing Thread Priorities
          5. 15.2.5.5. Adding Static Data to Rows of Data
          6. 15.2.5.6. Limiting the Number of Step Copies
          7. 15.2.5.7. Avoiding Excessive Logging
      3. 15.3. Improving Job Performance
        1. 15.3.1. Loops in Jobs
        2. 15.3.2. Database Connection Pools
      4. 15.4. Summary
    2. 16. Parallelization, Clustering, and Partitioning
      1. 16.1. Multi-Threading
        1. 16.1.1. Row Distribution
        2. 16.1.2. Row Merging
        3. 16.1.3. Row Redistribution
        4. 16.1.4. Data Pipelining
        5. 16.1.5. Consequences of Multi-Threading
          1. 16.1.5.1. Database Connections
          2. 16.1.5.2. Order of Execution
            1. 16.1.5.2.1. The Execute SQL Step
            2. 16.1.5.2.2. The Blocking Step
        6. 16.1.6. Parallel Execution in a Job
      2. 16.2. Using Carte as a Slave Server
        1. 16.2.1. The Configuration File
        2. 16.2.2. Defining Slave Servers
        3. 16.2.3. Remote Execution
        4. 16.2.4. Monitoring Slave Servers
        5. 16.2.5. Carte Security
        6. 16.2.6. Services
      3. 16.3. Clustering Transformations
        1. 16.3.1. Defining a Cluster Schema
        2. 16.3.2. Designing Clustered Transformations
        3. 16.3.3. Execution and Monitoring
        4. 16.3.4. Metadata Transformations
          1. 16.3.4.1. Rules
          2. 16.3.4.2. Data Pipelining
      4. 16.4. Partitioning
        1. 16.4.1. Defining a Partitioning Schema
        2. 16.4.2. Objectives of Partitioning
        3. 16.4.3. Implementing Partitioning
        4. 16.4.4. Internal Variables
        5. 16.4.5. Database Partitions
        6. 16.4.6. Partitioning in a Clustered Transformation
      5. 16.5. Summary
    3. 17. Dynamic Clustering in the Cloud
      1. 17.1. Dynamic Clustering
        1. 17.1.1. Setting Up a Dynamic Cluster
        2. 17.1.2. Using the Dynamic Cluster
      2. 17.2. Cloud Computing
      3. 17.3. EC2
        1. 17.3.1. Getting Started with EC2
        2. 17.3.2. Costs
        3. 17.3.3. Customizing an AMI
        4. 17.3.4. Packaging a New AMI
        5. 17.3.5. Terminating an AMI
        6. 17.3.6. Running a Master
        7. 17.3.7. Running the Slaves
        8. 17.3.8. Using the EC2 Cluster
        9. 17.3.9. Monitoring
        10. 17.3.10. The Lightweight Principle and Persistence Options
      4. 17.4. Summary
    4. 18. Real-Time Data Integration
      1. 18.1. Introduction to Real-Time ETL
        1. 18.1.1. Real-Time Challenges
        2. 18.1.2. Requirements
      2. 18.2. Transformation Streaming
        1. 18.2.1. A Practical Example of Transformation Streaming
        2. 18.2.2. Debugging
        3. 18.2.3. Third-Party Software and Real-Time Integration
        4. 18.2.4. Java Message Service
          1. 18.2.4.1. Creating a JMS Connection and Session
          2. 18.2.4.2. Consuming Messages
          3. 18.2.4.3. Producing Messages
          4. 18.2.4.4. Closing Shop
      3. 18.3. Summary
  10. V. Advanced Topics
    1. 19. Data Vault Management
      1. 19.1. Introduction to Data Vault Modeling
      2. 19.2. Do You Need a Data Vault?
      3. 19.3. Data Vault Building Blocks
        1. 19.3.1. Hubs
        2. 19.3.2. Links
        3. 19.3.3. Satellites
        4. 19.3.4. Data Vault Characteristics
        5. 19.3.5. Building a Data Vault
      4. 19.4. Transforming Sakila to the Data Vault Model
        1. 19.4.1. Sakila Hubs
        2. 19.4.2. Sakila Links
        3. 19.4.3. Sakila Satellites
      5. 19.5. Loading the Data Vault: A Sample ETL Solution
        1. 19.5.1. Installing the Sakila Data Vault
        2. 19.5.2. Setting Up the ETL Solution
        3. 19.5.3. Creating a Database Account
        4. 19.5.4. The Sample ETL Data Vault Solution
          1. 19.5.4.1. Sample Hub: hub_actor
          2. 19.5.4.2. Sample Link: link_customer_store
          3. 19.5.4.3. Sample Satellite: sat_actor
        5. 19.5.5. Loading the Data Vault Tables
      6. 19.6. Updating a Data Mart from a Data Vault
        1. 19.6.1. The Sample ETL Solution
        2. 19.6.2. The dim_actor Transformation
        3. 19.6.3. The dim_customer Transformation
        4. 19.6.4. The dim_film Transformation
        5. 19.6.5. The dim_film_actor_bridge Transformation
        6. 19.6.6. The fact_rental Transformation
        7. 19.6.7. Loading the Star Schema Tables
      7. 19.7. Summary
    2. 20. Handling Complex Data Formats
      1. 20.1. Non-Relational and Non-Tabular Data Formats
      2. 20.2. Non-Relational Tabular Formats
        1. 20.2.1. Handling Multi-Valued Attributes
          1. 20.2.1.1. Using the Split Field to Rows Step
        2. 20.2.2. Handling Repeating Groups
          1. 20.2.2.1. Using the Row Normaliser Step
      3. 20.3. Semi- and Unstructured Data
        1. 20.3.1. Kettle Regular Expression Example
          1. 20.3.1.1. Configuring the Regex Evaluation Step
            1. 20.3.1.1.1. Settings Tag Page
            2. 20.3.1.1.2. Fields Grid
            3. 20.3.1.1.3. Content Tab Page
          2. 20.3.1.2. Verifying the Match
      4. 20.4. Key/Value Pairs
        1. 20.4.1. Kettle Key/Value Pairs Example
          1. 20.4.1.1. Text File Input
          2. 20.4.1.2. Regex Evaluation
          3. 20.4.1.3. Grouping Lines into Records
          4. 20.4.1.4. Denormaliser: Turning Rows into Columns
      5. 20.5. Summary
    3. 21. Web Services
      1. 21.1. Web Pages and Web Services
        1. 21.1.1. Kettle Web Features
          1. 21.1.1.1. General HTTP Steps
          2. 21.1.1.2. Simple Object Access Protocol
          3. 21.1.1.3. Really Simple Syndication
          4. 21.1.1.4. Apache Virtual File System Integration
      2. 21.2. Data Formats
        1. 21.2.1. XML
          1. 21.2.1.1. Kettle Steps for Working with XML
          2. 21.2.1.2. Kettle Job Entries for XML
        2. 21.2.2. HTML
        3. 21.2.3. JavaScript Object Notation
          1. 21.2.3.1. Syntax
          2. 21.2.3.2. JSON, Kettle, and ETL/DI
      3. 21.3. XML Examples
        1. 21.3.1. Example XML Document
          1. 21.3.1.1. XML Document Structure
          2. 21.3.1.2. Mapping to the Sakila Sample Database
        2. 21.3.2. Extracting Data from XML
          1. 21.3.2.1. Overall Design: The import_xml_into_db Transformation
          2. 21.3.2.2. Using the XSD Validator Step
            1. 21.3.2.2.1. XML Source
            2. 21.3.2.2.2. Output Fields
            3. 21.3.2.2.3. XML Schema Definition
            4. 21.3.2.2.4. Error Handling
            5. 21.3.2.2.5. Checking the Validation Result
          3. 21.3.2.3. Using the "Get Data from XML" Step
            1. 21.3.2.3.1. The File Tab
            2. 21.3.2.3.2. The Content Tab
            3. 21.3.2.3.3. The Fields Tab
            4. 21.3.2.3.4. Using Tokens
        3. 21.3.3. Generating XML Documents
          1. 21.3.3.1. Overall Design: The export_xml_from_db Transformation
          2. 21.3.3.2. Generating XML with the Add XML Step
            1. 21.3.3.2.1. The Content Tab
            2. 21.3.3.2.2. The Fields Tab
          3. 21.3.3.3. Using the XML Join Step
            1. 21.3.3.3.1. Target
            2. 21.3.3.3.2. Source
            3. 21.3.3.3.3. Join Condition
            4. 21.3.3.3.4. Result Stream
            5. 21.3.3.3.5. Complex Join Conditions
      4. 21.4. SOAP Examples
        1. 21.4.1. Using the "Web services lookup" Step
          1. 21.4.1.1. Configuring the "Web services lookup" Step
            1. 21.4.1.1.1. The Web Services Tab
            2. 21.4.1.1.2. The "in" Tab
            3. 21.4.1.1.3. The Result Tab
        2. 21.4.2. Accessing SOAP Services Directly
      5. 21.5. JSON Example
        1. 21.5.1. The Freebase Project
          1. 21.5.1.1. Freebase Versus Wikipedia
          2. 21.5.1.2. Freebase Web Services
          3. 21.5.1.3. The Freebase Read Service
          4. 21.5.1.4. The Metaweb Query Language
        2. 21.5.2. Extracting Freebase Data with Kettle
          1. 21.5.2.1. Generate Rows
          2. 21.5.2.2. Issuing a Freebase Read Request
          3. 21.5.2.3. Processing the Freebase Result Envelope
          4. 21.5.2.4. Filtering Out the Original Row
          5. 21.5.2.5. Storing to File
      6. 21.6. RSS
        1. 21.6.1. RSS Structure
          1. 21.6.1.1. Channel
          2. 21.6.1.2. Item
        2. 21.6.2. RSS Support in Kettle
          1. 21.6.2.1. RSS Input
          2. 21.6.2.2. RSS Output
            1. 21.6.2.2.1. Source Data
            2. 21.6.2.2.2. Standard Feed
            3. 21.6.2.2.3. Custom Feed
      7. 21.7. Summary
    4. 22. Kettle Integration
      1. 22.1. The Kettle API
        1. 22.1.1. The LGPL License
        2. 22.1.2. The Kettle Java API
          1. 22.1.2.1. Source Code
          2. 22.1.2.2. Building Kettle
          3. 22.1.2.3. Building javadoc
          4. 22.1.2.4. Libraries and the Class Path
      2. 22.2. Executing Existing Transformations and Jobs
        1. 22.2.1. Executing a Transformation
        2. 22.2.2. Executing a Job
      3. 22.3. Embedding Kettle
        1. 22.3.1. Pentaho Reporting
        2. 22.3.2. Putting Data into a Transformation
            1. 22.3.2.1.1. Passing Data with a Result Object
            2. 22.3.2.1.2. Passing Data in a Streaming Fashion
            3. 22.3.2.1.3. Setting Parameters and Variables
        3. 22.3.3. Dynamic Transformations
        4. 22.3.4. Dynamic Template
        5. 22.3.5. Dynamic Jobs
        6. 22.3.6. Executing Dynamic ETL in Kettle
        7. 22.3.7. Result
        8. 22.3.8. Replacing Metadata
          1. 22.3.8.1. Direct Changes with the API
          2. 22.3.8.2. Using a Shared Objects File
      4. 22.4. OEM Versions and Forks
        1. 22.4.1. Creating an OEM Version of PDI
        2. 22.4.2. Forking Kettle
      5. 22.5. Summary
    5. 23. Extending Kettle
      1. 23.1. Plugin Architecture Overview
        1. 23.1.1. Plugin Types
        2. 23.1.2. Architecture
        3. 23.1.3. Prerequisites
          1. 23.1.3.1. Kettle API Documentation
          2. 23.1.3.2. Libraries
          3. 23.1.3.3. Integrated Development Environment
          4. 23.1.3.4. Eclipse Project Setup
          5. 23.1.3.5. Examples
      2. 23.2. Transformation Step Plugins
        1. 23.2.1. StepMetaInterface
          1. 23.2.1.1. Value Metadata
          2. 23.2.1.2. Row Metadata
        2. 23.2.2. StepDataInterface
        3. 23.2.3. StepDialogInterface
          1. 23.2.3.1. Eclipse SWT
          2. 23.2.3.2. Form Layout
          3. 23.2.3.3. Kettle UI Elements
          4. 23.2.3.4. Hello World Example Dialog
        4. 23.2.4. StepInterface
          1. 23.2.4.1. Reading Rows from Specific Steps
          2. 23.2.4.2. Writing Rows to Specific Steps
          3. 23.2.4.3. Writing Rows to Error Handling
          4. 23.2.4.4. Identifying a Step Copy
          5. 23.2.4.5. Result Feedback
          6. 23.2.4.6. Variable Substitution
          7. 23.2.4.7. Apache VFS
          8. 23.2.4.8. Step Plugin Deployment
      3. 23.3. The User-Defined Java Class Step
        1. 23.3.1. Passing Metadata
        2. 23.3.2. Accessing Input and Fields
        3. 23.3.3. Snippets
        4. 23.3.4. Example
      4. 23.4. Job Entry Plugins
        1. 23.4.1. JobEntryInterface
        2. 23.4.2. JobEntryDialogInterface
      5. 23.5. Partitioning Method Plugins
        1. 23.5.1. Partitioner
      6. 23.6. Repository Type Plugins
      7. 23.7. Database Type Plugins
      8. 23.8. Summary
  11. A. The Kettle Ecosystem
    1. A.1. Kettle Development and Versions
    2. A.2. The Pentaho Community Wiki
    3. A.3. Using the Forums
    4. A.4. Jira
    5. A.5. ##pentaho
  12. B. Kettle Enterprise Edition Features
  13. C. Built-in Variables and Properties Reference
    1. C.1. Internal Variables
    2. C.2. Kettle Variables
    3. C.3. Variables for Configuring VFS
    4. C.4. Noteworthy JRE Variables

Product information

  • Title: Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration
  • Author(s):
  • Release date: September 2010
  • Publisher(s): Wiley
  • ISBN: 9780470635179