You are previewing Pentaho 3.2 Data Integration Beginner's Guide.
O'Reilly logo
Pentaho 3.2 Data Integration Beginner's Guide

Book Description

Explore, transform, validate, and integrate your data with ease

  • Get started with Pentaho Data Integration from scratch.

  • Enrich your data transformation operations by embedding Java and JavaScript code in PDI transformations.

  • Create a simple but complete Datamart Project that will cover all key features of PDI.

  • Part of Packt's Beginner's Guide series: Each task gradually develops from the previous task with an organized sequence of instructions accompanied with relevant explanation and a set of challenging tasks that the reader will be able to accomplish.

  • In Detail

    Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.

    This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.

    Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.

    Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

    A practical, easy-to-read guide that gives you full understanding of the Pentaho Data Integration tool and shows you how to use it to your advantage to manipulate data

    Table of Contents

    1. Pentaho 3.2 Data Integration Beginner's Guide
      1. Pentaho 3.2 Data Integration
      2. Credits
      3. Foreword
      4. The Kettle Project
      5. About the Author
      6. About the Reviewers
      7. Preface
        1. How to read this book
        2. What this book covers
        3. What you need for this book
        4. Who this book is for
        5. Conventions
        6. Reader feedback
        7. Customer support
          1. Errata
          2. Piracy
          3. Questions
      8. 1. Getting Started with Pentaho Data Integration
        1. Pentaho Data Integration and Pentaho BI Suite
          1. Exploring the Pentaho Demo
        2. Pentaho Data Integration
          1. Using PDI in real world scenarios
            1. Loading datawarehouses or datamarts
            2. Integrating data
            3. Data cleansing
            4. Migrating information
            5. Exporting data
            6. Integrating PDI using Pentaho BI
          2. Pop quiz—PDI data sources
        3. Installing PDI
        4. Time for action—installing PDI
          1. What just happened?
          2. Pop quiz—PDI prerequisites
        5. Launching the PDI graphical designer: Spoon
        6. Time for action—starting and customizing Spoon
          1. What just happened?
          2. Spoon
            1. Setting preferences in the Options window
            2. Storing transformations and jobs in a repository
          3. Creating your first transformation
        7. Time for action—creating a hello world transformation
          1. What just happened?
            1. Directing the Kettle engine with transformations
            2. Exploring the Spoon interface
              1. Viewing the transformation structure
            3. Running and previewing the transformation
        8. Time for action—running and previewing the hello_world transformation
          1. What just happened?
              1. Previewing the results in the Execution Results window
          2. Pop quiz—PDI basics
        9. Installing MySQL
        10. Time for action—installing MySQL on Windows
          1. What just happened?
        11. Time for action—installing MySQL on Ubuntu
          1. What just happened?
        12. Summary
      9. 2. Getting Started with Transformations
        1. Reading data from files
        2. Time for action—reading results of football matches from files
          1. What just happened?
          2. Input files
            1. Input steps
          3. Reading several files at once
        3. Time for action—reading all your files at a time using a single Text file input step
          1. What just happened?
        4. Time for action reading all your files at a time using a single Text file input step and regular expressions
          1. What just happened?
            1. Regular expressions
          2. Troubleshooting reading files
          3. Grids
          4. Have a go hero—explore your own files
        5. Sending data to files
        6. Time for action—sending the results of matches to a plain file
          1. What just happened?
          2. Output files
            1. Output steps
          3. Some data definitions
            1. Rowset
            2. Streams
          4. The Select values step
          5. Have a go hero—extending your transformations by writing output files
        7. Getting system information
        8. Time for action—updating a file with news about examinations
          1. What just happened?
          2. Getting information by using Get System Info step
          3. Data types
            1. Date fields
            2. Numeric fields
          4. Running transformations from a terminal window
        9. Time for action—running the examination transformation from a terminal window
          1. What just happened?
          2. Have a go hero—using different date formats
          3. Go for a hero formatting 99.55
          4. Pop quiz—formatting data
        10. XML files
        11. Time for action—getting data from an XML file with information about countries
          1. What just happened?
          2. What is XML
            1. PDI transformation files
          3. Getting data from XML files
            1. XPath
            2. Configuring the Get data from XML step
          4. Kettle variables
            1. How and when you can use variables
          5. Have a go hero—exploring XML files
          6. Have a go hero—enhancing the output countries file
          7. Have a go hero—documenting your work
        12. Summary
      10. 3. Basic Data Manipulation
        1. Basic calculations
        2. Time for action—reviewing examinations by using the Calculator step
          1. What just happened?
          2. Adding or modifying fields by using different PDI steps
            1. The Calculator step
            2. The Formula step
        3. Time for action—reviewing examinations by using the Formula step
          1. What just happened?
          2. Have a go hero—listing students and their examinations results
          3. Pop quiz—concatenating strings
        4. Calculations on groups of rows
        5. Time for action—calculating World Cup statistics by grouping data
          1. What just happened?
          2. Group by step
          3. Have a go hero—calculating statistics for the examinations
          4. Have a go hero—listing the languages spoken by country
        6. Filtering
        7. Time for action—counting frequent words by filtering
          1. What just happened?
          2. Filtering rows using the Filter rows step
          3. Have a go hero—playing with filters
          4. Have a go hero—counting words and discarding those that are commonly used
        8. Looking up data
        9. Time for action—finding out which language people speak
          1. What just happened?
          2. The Stream lookup step
          3. Have a go hero—counting words more precisely
        10. Summary
      11. 4. Controlling the Flow of Data
        1. Splitting streams
        2. Time for action—browsing new PDI features by copying a dataset
          1. What just happened?
          2. Copying rows
          3. Have a go hero—recalculating statistics
          4. Distributing rows
        3. Time for action—assigning tasks by distributing
          1. What just happened?
          2. Pop quiz—data movement (copying and distributing)
        4. Splitting the stream based on conditions
        5. Time for action - assigning tasks by filtering priorities with the Filter rows step
          1. What just happened?
          2. PDI steps for splitting the stream based on conditions
        6. Time for action—assigning tasks by filtering priorities with the Switch/ Case step
          1. What just happened?
          2. Have a go hero—listing languages and countries
          3. Pop quiz—splitting a stream
        7. Merging streams
        8. Time for action—gathering progress and merging all together
          1. What just happened?
          2. PDI options for merging streams
        9. Time for action—giving priority to Bouchard by using Append Stream
          1. What just happened?
          2. Have a go hero—sorting and merging all tasks
          3. Have a go hero—trying to find missing countries
        10. Summary
      12. 5. Transforming Your Data with JavaScript Code and the JavaScript Step
        1. Doing simple tasks with the JavaScript step
        2. Time for action—calculating scores with JavaScript
          1. What just happened?
          2. Using the JavaScript language in PDI
          3. Inserting JavaScript code using the Modified Java Script Value step
            1. Adding fields
            2. Modifying fields
            3. Turning on the compatibility switch
          4. Have a go hero—adding and modifying fields to the contest data
          5. Testing your code
        3. Time for action—testing the calculation of averages
          1. What just happened?
            1. Testing the script using the Test script button
          2. Have a go hero—testing the new calculation of the average
        4. Enriching the code
        5. Time for action calculating flexible scores by using variables
          1. What just happened?
          2. Using named parameters
          3. Using the special Start, Main, and End scripts
          4. Using transformation predefined constants
          5. Pop quiz—finding the 7 errors
          6. Have a go hero—keeping the top 10 performances
          7. Have a go hero—calculating scores with Java code
        6. Reading and parsing unstructured files
        7. Time for action—changing a list of house descriptions with JavaScript
          1. What just happened?
          2. Looking at previous rows
          3. Have a go hero—enhancing the houses file
          4. Have a go hero—fill gaps in the contest file
        8. Avoiding coding by using purpose-built steps
          1. Have a go hero—creating alternative solutions
        9. Summary
      13. 6. Transforming the Row Set
        1. Converting rows to columns
        2. Time for action—enhancing a films file by converting rows to columns
          1. What just happened?
          2. Converting row data to column data by using the Row denormalizer step
          3. Have a go hero—houses revisited
          4. Aggregating data with a Row denormalizer step
        3. Time for action—calculating total scores by performances by country
          1. What just happened?
          2. Using Row denormalizer for aggregating data
          3. Have a go hero—calculating scores by skill by continent
        4. Normalizing data
        5. Time for action—enhancing the matches file by normalizing the dataset
          1. What just happened?
          2. Modifying the dataset with a Row Normalizer step
          3. Summarizing the PDI steps that operate on sets of rows
          4. Have a go hero—verifying the benefits of normalization
          5. Have a go hero—normalizing the Films file
          6. Have a go hero—calculating scores by judge
        6. Generating a custom time dimension dataset by using Kettle variables
        7. Time for action—creating the time dimension dataset
          1. What just happened?
          2. Getting variables
        8. Time for action—getting variables for setting the default starting date
          1. What just happened?
            1. Using the Get Variables step
          2. Have a go hero—enhancing the time dimension
          3. Pop quiz—using Kettle variables inside transformations
        9. Summary
      14. 7. Validating Data and Handling Errors
        1. Capturing errors
        2. Time for action—capturing errors while calculating the age of a film
          1. What just happened?
          2. Using PDI error handling functionality
          3. Aborting a transformation
        3. Time for action—aborting when there are too many errors
          1. What just happened?
            1. Aborting a transformation using the Abort step
          2. Fixing captured errors
        4. Time for action—treating errors that may appear
          1. What just happened?
            1. Treating rows coming to the error stream
          2. Pop quiz—PDI error handling
          3. Have a go hero—capturing errors while seeing who wins
        5. Avoiding unexpected errors by validating data
        6. Time for action validating genres with a Regex Evaluation step
          1. What just happened?
          2. Validating data
        7. Time for action—checking films file with the Data Validator
          1. What just happened?
            1. Defining simple validation rules using the Data Validator
          2. Have a go hero—validating the football matches file
          3. Cleansing data
          4. Have a go hero—cleansing films data
        8. Summary
      15. 8. Working with Databases
        1. Introducing the Steel Wheels sample database
          1. Connecting to the Steel Wheels database
        2. Time for action—creating a connection with the Steel Wheels database
          1. What just happened?
            1. Connecting with Relational Database Management Systems
          2. Pop quiz—defining database connections
          3. Have a go hero—connecting to your own databases
          4. Exploring the Steel Wheels database
        3. Time for action—exploring the sample database
          1. What just happened?
            1. A brief word about SQL
            2. Exploring any configured database with the PDI Database explorer
          2. Have a go hero—exploring the sample data in depth
          3. Have a go hero—exploring your own databases
        4. Querying a database
        5. Time for action—getting data about shipped orders
          1. What just happened?
          2. Getting data from the database with the Table input step
          3. Using the SELECT statement for generating a new dataset
            1. Making flexible queries by using parameters
        6. Time for action—getting orders in a range of dates by using parameters
          1. What just happened?
              1. Adding parameters to your queries
            1. Making flexible queries by using Kettle variables
        7. Time for action—getting orders in a range of dates by using variables
          1. What just happened?
              1. Using Kettle variables in your queries
          2. Pop quiz—database datatypes versus PDI datatypes
          3. Have a go hero—querying the sample data
        8. Sending data to a database
        9. Time for action—loading a table with a list of manufacturers
          1. What just happened?
          2. Inserting new data into a database table with the Table output step
          3. Inserting or updating data by using other PDI steps
        10. Time for action—inserting new products or updating existent ones
          1. What just happened?
        11. Time for action—testing the update of existing products
          1. What just happened?
            1. Inserting or updating data with the Insert/Update step
          2. Have a go hero—populating a films database
          3. Have a go hero—creating the time dimension
          4. Have a go hero—populating the products table
          5. Pop quiz—Insert/Update step versus Table Output/Update steps
          6. Pop quiz—filtering the first 10 rows
        12. Eliminating data from a database
        13. Time for action—deleting data about discontinued items
          1. What just happened?
          2. Deleting records of a database table with the Delete step
          3. Have a go hero—deleting old orders
        14. Summary
      16. 9. Performing Advanced Operations with Databases
        1. Preparing the environment
        2. Time for action—populating the Jigsaw database
          1. What just happened?
          2. Exploring the Jigsaw database model
        3. Looking up data in a database
          1. Doing simple lookups
        4. Time for action—using a Database lookup step to create a list of products to buy
          1. What just happened?
            1. Looking up values in a database with the Database lookup step
          2. Have a go hero—preparing the delivery of the products
          3. Have a go hero—refining the transformation
          4. Doing complex lookups
        5. Time for action using a Database join step to create a list of suggested products to buy
          1. What just happened?
            1. Joining data from the database to the stream data by using a Database join step
          2. Have a go hero—rebuilding the list of customers
        6. Introducing dimensional modeling
        7. Loading dimensions with data
        8. Time for action loading a region dimension with a Combination lookup/update step
          1. What just happened?
        9. Time for action—testing the transformation that loads the region dimension
          1. What just happened?
          2. Describing data with dimensions
            1. Loading Type I SCD with a Combination lookup/update step
          3. Have a go hero—adding regions to the Region Dimension
          4. Have a go hero—loading the manufacturers dimension
          5. Have a go hero—loading a mini-dimension
          6. Keeping a history of changes
        10. Time for action—keeping a history of product changes with the Dimension lookup/update step
          1. What just happened?
        11. Time for action—testing the transformation that keeps a history of product changes
          1. What just happened?
            1. Keeping an entire history of data with a Type II slowly changing dimension
          2. Loading Type II SCDs with the Dimension lookup/update step
          3. Have a go hero—keeping a history just for the theme of a product
          4. Have a go hero—loading a Type II SCD dimension
          5. Pop quiz—loading slowly changing dimensions
          6. Pop quiz—loading type III slowly changing dimensions
        12. Summary
      17. 10. Creating Basic Task Flows
        1. Introducing PDI jobs
        2. Time for action—creating a simple hello world job
          1. What just happened?
          2. Executing processes with PDI jobs
            1. Using Spoon to design and run jobs
          3. Using the transformation job entry
          4. Pop quiz—defining PDI jobs
          5. Have a go hero—loading the dimension tables
        3. Receiving arguments and parameters in a job
        4. Time for action—customizing the hello world file with arguments and parameters
          1. What just happened?
          2. Using named parameters in jobs
          3. Have a go hero—backing up your work
        5. Running jobs from a terminal window
        6. Time for action—executing the hello world job from a terminal window
          1. What just happened?
          2. Have a go hero—experiencing Kitchen
        7. Using named parameters and command-line arguments in transformations
        8. Time for action—calling the hello world transformation with fixed arguments and parameters
          1. What just happened?
          2. Have a go hero—saying hello again and again
          3. Have a go hero—loading the time dimension from a job
        9. Deciding between the use of a command-line argument and a named parameter
          1. Have a go hero—analysing the use of arguments and named parameters
        10. Running job entries under conditions
        11. Time for action—sending a sales report and warning the administrator if something is wrong
          1. What just happened?
          2. Changing the flow of execution on the basis of conditions
          3. Have a go hero—refining the sales report
          4. Creating and using a file results list
          5. Have a go hero—sharing your work
        12. Summary
      18. 11. Creating Advanced Transformations and Jobs
        1. Enhancing your processes with the use of variables
        2. Time for action—updating a file with news about examinations by setting a variable with the name of the file
          1. What just happened?
          2. Setting variables inside a transformation
          3. Have a go hero—enhancing the examination tutorial even more
          4. Have a go hero—enhancing the jigsaw database update process
          5. Have a go hero—executing the proper jigsaw database update process
        3. Enhancing the design of your processes
        4. Time for action—generating files with top scores
          1. What just happened?
          2. Pop quiz—using the Add Sequence step
          3. Reusing part of your transformations
        5. Time for action—calculating the top scores with a subtransformation
          1. What just happened?
            1. Creating and using subtransformations
          2. Have a go hero—refining the subtransformation
          3. Have a go hero—counting words more precisely (second version)
          4. Creating a job as a process flow
        6. Time for action—splitting the generation of top scores by copying and getting rows
          1. What just happened?
            1. Transferring data between transformations by using the copy /get rows mechanism
          2. Have a go hero—modifying the flow
          3. Nesting jobs
        7. Time for action—generating the files with top scores by nesting jobs
          1. What just happened?
            1. Running a job inside another job with a job entry
            2. Understanding the scope of variables
          2. Pop quiz—deciding the scope of variables
        8. Iterating jobs and transformations
        9. Time for action—generating custom files by executing a transformation for every input row
          1. What just happened?
          2. Executing for each row
          3. Have a go hero—processing several files at once
          4. Have a go hero—building lists of products to buy
          5. Have a go hero—e-mail students to let them know how they did
        10. Summary
      19. 12. Developing and Implementing a Simple Datamart
        1. Exploring the sales datamart
          1. Deciding the level of granularity
        2. Loading the dimensions
        3. Time for action—loading dimensions for the sales datamart
          1. What just happened?
        4. Extending the sales datamart model
          1. Have a go hero—loading the dimensions for the puzzles star model
        5. Loading a fact table with aggregated data
        6. Time for action—loading the sales fact table by looking up dimensions
          1. What just happened?
          2. Getting the information from the source with SQL queries
          3. Translating the business keys into surrogate keys
            1. Obtaining the surrogate key for a Type I SCD
            2. Obtaining the surrogate key for a Type II SCD
            3. Obtaining the surrogate key for the Junk dimension
            4. Obtaining the surrogate key for the Time dimension
          4. Pop quiz—modifying a star model and loading the star with PDI
          5. Have a go hero—loading a puzzles fact table
        7. Getting facts and dimensions together
        8. Time for action—loading the fact table using a range of dates obtained from the command line
          1. What just happened?
        9. Time for action—loading the sales star
          1. What just happened?
          2. Have a go hero—enhancing the loading process of the sales fact table
          3. Have a go hero—loading the puzzles sales star
          4. Have a go hero—loading the facts once a month
        10. Getting rid of administrative tasks
        11. Time for action—automating the loading of the sales datamart
          1. What just happened?
          2. Have a go hero—Creating a back up of your work automatically
          3. Have a go hero—enhancing the automate process by sending an e-mail if an error occurs
        12. Summary
      20. 13. Taking it Further
        1. PDI best practices
        2. Getting the most out of PDI
          1. Extending Kettle with plugins
          2. Have a go hero—listing the top 10 students by using the Head plugin step
          3. Overcoming real world risks with some remote execution
          4. Scaling out to overcome bigger risks
          5. Pop quiz—remote execution and clustering
        3. Integrating PDI and the Pentaho BI suite
          1. PDI as a process action
          2. PDI as a datasource
          3. More about the Pentaho suite
        4. PDI Enterprise Edition and Kettle Developer Support
        5. Summary
      21. A. Working with Repositories
        1. Creating a repository
        2. Time for action—creating a PDI repository
          1. What just happened?
          2. Creating repositories to store your transformations and jobs
        3. Working with the repository storage system
        4. Time for action—logging into a repository
          1. What just happened?
          2. Logging into a repository by using credentials
            1. Defining repository user accounts
          3. Creating transformations and jobs in repository folders
          4. Creating database connections, partitions, servers, and clusters
          5. Backing up and restoring a repository
        5. Examining and modifying the contents of a repository with the Repository explorer
        6. Migrating from a file-based system to a repository-based system and vice-versa
        7. Summary
      22. B. Pan and Kitchen: Launching Transformations and Jobs from the Command Line
        1. Running transformations and jobs stored in files
        2. Running transformations and jobs from a repository
          1. Specifying command line options
        3. Checking the exit code
        4. Providing options when running Pan and Kitchen
          1. Log details
          2. Named parameters
          3. Arguments
          4. Variables
      23. C. Quick Reference: Steps and Job Entries
        1. Transformation steps
        2. Job entries
      24. D. Spoon Shortcuts
        1. General shortcuts
        2. Designing transformations and jobs
        3. Grids
        4. Repositories
      25. E. Introducing PDI 4 Features
        1. Agile BI
        2. Visual improvements for designing transformations and jobs
          1. Experiencing the mouse-over assistance
        3. Time for action—creating a hop with the mouse-over assistance
          1. What just happened?
            1. Using the mouse-over assistance toolbar
          2. Experiencing the sniff-testing feature
          3. Experiencing the job drill-down feature
          4. Experiencing even more visual changes
        4. Enterprise features
        5. Summary
      26. F. Pop quiz—Answers
        1. Chapter 1
          1. PDI data sources
          2. PDI prerequisites PDI basics
        2. Chapter 2
          1. formatting data
        3. Chapter 3
          1. concatenating strings
        4. Chapter 4
          1. data movement (copying and distributing)
          2. splitting a stream
        5. Chapter 5
          1. finding the seven errors
        6. Chapter 6
          1. using Kettle variables inside transformations
        7. Chapter 7
          1. PDI error handling
        8. Chapter 8
          1. defining database connections
          2. database datatypes versus PDI datatypes
          3. Insert/Update step versus Table Output/Update steps
          4. filtering the first 10 rows
        9. Chapter 9
          1. loading slowly changing dimensions
          2. loading type III slowly changing dimensions
        10. Chapter 10
          1. defining PDI jobs
        11. Chapter 11
          1. using the Add sequence step
          2. deciding the scope of variables
        12. Chapter 12
          1. modifying a star model and loading the star with PDI
        13. Chapter 13
          1. remote execution and clustering