Book description
A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL
This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions—before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.
Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle
Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud"
Get the most out of Pentaho Kettle and your data warehousing with this detailed guide—from simple single table data migration to complex multisystem clustered data integration tasks.
Table of contents
- Copyright
- About the Authors
- Credits
- Acknowledgments
- Introduction
-
I. Getting Started
- 1. ETL Primer
-
2. Kettle Concepts
- 2.1. Design Principles
- 2.2. The Building Blocks of Kettle Design
- 2.3. Parameters and Variables
- 2.4. Visual Programming
- 2.5. Summary
-
3. Installation and Configuration
- 3.1. Kettle Software Overview
- 3.2. Installation
- 3.3. Configuration
- 3.4. Summary
-
4. An Example ETL Solution—Sakila
- 4.1. Sakila
- 4.2. Prerequisites and Some Basic Spoon Skills
-
4.3. The Sample ETL Solution
- 4.3.1. Static, Generated Dimensions
-
4.3.2. Recurring Load
- 4.3.2.1. The load_rentals Job
- 4.3.2.2. The load_dim_staff Transformation
- 4.3.2.3. Database Connections
- 4.3.2.4. The load_dim_customer Transformation
- 4.3.2.5. The load_dim_store Transformation
- 4.3.2.6. The fetch_address Subtransformation
- 4.3.2.7. The load_dim_actor Transformation
- 4.3.2.8. The load_dim_film Transformation
- 4.3.2.9. The load_fact_rental Transformation
- 4.4. Summary
-
II. ETL
-
5. ETL Subsystems
-
5.1. Introduction to the 34 Subsystems
- 5.1.1. Extraction
- 5.1.2. Cleaning and Conforming Data
-
5.1.3. Data Delivery
- 5.1.3.1. Subsystem 9: Slowly Changing Dimension Processor
- 5.1.3.2. Subsystem 10: Surrogate Key Creation System
- 5.1.3.3. Subsystem 11: Hierarchy Dimension Builder
- 5.1.3.4. Subsystem 12: Special Dimension Builder
- 5.1.3.5. Subsystem 13: Fact Table Loader
- 5.1.3.6. Subsystem 14: Surrogate Key Pipeline
- 5.1.3.7. Subsystem 15: Multi-Valued Dimension Bridge Table Builder
- 5.1.3.8. Subsystem 16: Late-Arriving Data Handler
- 5.1.3.9. Subsystem 17: Dimension Manager System
- 5.1.3.10. Subsystem 18: Fact Table Provider System
- 5.1.3.11. Subsystem 19: Aggregate Builder
- 5.1.3.12. Subsystem 20: Multidimensional (OLAP) Cube Builder
- 5.1.3.13. Subsystem 21: Data Integration Manager
- 5.1.4. Managing the ETL Environment
- 5.2. Summary
-
5.1. Introduction to the 34 Subsystems
-
6. Data Extraction
- 6.1. Kettle Data Extraction Overview
- 6.2. Working with ERP and CRM Systems
-
6.3. Data Profiling
-
6.3.1. Using eobjects.org DataCleaner
- 6.3.1.1. Adding Profile Tasks
- 6.3.1.2. Adding Database Connections
- 6.3.1.3. Doing an Initial Profile
- 6.3.1.4. Working with Regular Expressions
- 6.3.1.5. Profiling and Exploring Results
- 6.3.1.6. Validating and Comparing Data
- 6.3.1.7. Using a Dictionary for Column Dependency Checks
- 6.3.1.8. Alternative Solutions
- 6.3.1.9. Text Profiling with Kettle
-
6.3.1. Using eobjects.org DataCleaner
- 6.4. CDC: Change Data Capture
- 6.5. Delivering Data
- 6.6. Summary
- 7. Cleansing and Conforming
-
8. Handling Dimension Tables
- 8.1. Managing Keys
- 8.2. Loading Dimension Tables
- 8.3. Slowly Changing Dimensions
- 8.4. More Dimensions
- 8.5. Summary
- 9. Loading Fact Tables
- 10. Working with OLAP Data
-
5. ETL Subsystems
-
III. Management and Deployment
- 11. ETL Development Lifecycle
- 12. Scheduling and Monitoring
- 13. Versioning and Migration
- 14. Lineage and Auditing
-
IV. Performance and Scalability
-
15. Performance Tuning
- 15.1. Transformation Performance: Finding the Weakest Link
- 15.2. Improving Transformation Performance
- 15.3. Improving Job Performance
- 15.4. Summary
- 16. Parallelization, Clustering, and Partitioning
- 17. Dynamic Clustering in the Cloud
- 18. Real-Time Data Integration
-
15. Performance Tuning
-
V. Advanced Topics
-
19. Data Vault Management
- 19.1. Introduction to Data Vault Modeling
- 19.2. Do You Need a Data Vault?
- 19.3. Data Vault Building Blocks
- 19.4. Transforming Sakila to the Data Vault Model
- 19.5. Loading the Data Vault: A Sample ETL Solution
- 19.6. Updating a Data Mart from a Data Vault
- 19.7. Summary
- 20. Handling Complex Data Formats
-
21. Web Services
- 21.1. Web Pages and Web Services
- 21.2. Data Formats
-
21.3. XML Examples
- 21.3.1. Example XML Document
- 21.3.2. Extracting Data from XML
- 21.3.3. Generating XML Documents
- 21.4. SOAP Examples
- 21.5. JSON Example
- 21.6. RSS
- 21.7. Summary
-
22. Kettle Integration
- 22.1. The Kettle API
- 22.2. Executing Existing Transformations and Jobs
- 22.3. Embedding Kettle
- 22.4. OEM Versions and Forks
- 22.5. Summary
-
23. Extending Kettle
- 23.1. Plugin Architecture Overview
- 23.2. Transformation Step Plugins
- 23.3. The User-Defined Java Class Step
- 23.4. Job Entry Plugins
- 23.5. Partitioning Method Plugins
- 23.6. Repository Type Plugins
- 23.7. Database Type Plugins
- 23.8. Summary
-
19. Data Vault Management
- A. The Kettle Ecosystem
- B. Kettle Enterprise Edition Features
- C. Built-in Variables and Properties Reference
Product information
- Title: Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration
- Author(s):
- Release date: September 2010
- Publisher(s): Wiley
- ISBN: 9780470635179
You might also like
book
Pentaho Data Integration Quick Start Guide
Get productive quickly with Pentaho Data Integration Key Features Take away the pain of starting with …
book
Pentaho Data Integration Cookbook Second Edition
The premier open source ETL tool is at your command with this recipe-packed cookbook. Learn to …
book
Pentaho® Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL®
Your all-in-one resource for using Pentaho with MySQL for Business Intelligence and Data Warehousing Open-source Pentaho …
book
Pentaho Data Integration Beginner's Guide
Extract, Transform, and Load (ETL) is the essence of data integration and this book shows you …