You are previewing Pentaho Data Integration Beginner's Guide.
O'Reilly logo
Pentaho Data Integration Beginner's Guide

Book Description

Extract, Transform, and Load (ETL) is the essence of data integration and this book shows you how to achieve it quickly and efficiently using Pentaho Data. A hands-on guide that you’ll find an indispensable time-saver.

  • Manipulate your data by exploring, transforming, validating, and integrating it

  • Learn to migrate data between applications

  • Explore several features of Pentaho Data Integration 5.0

  • Connect to any database engine, explore the databases, and perform all kind of operations on databases

  • In Detail

    Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work.

    Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing.

    "Pentaho Data Integration Beginner's Guide, Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration.

    "Pentaho Data Integration Beginner's Guide, Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart.

    With "Pentaho Data Integration Beginner's Guide, Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

    Table of Contents

    1. Pentaho Data Integration Beginner's Guide
      1. Table of Contents
      2. Pentaho Data Integration Beginner's Guide
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      7. Preface
        1. How to read this book
        2. What this book covers
        3. What you need for this book
        4. Who this book is for
        5. Conventions
        6. Time for action – heading
          1. What just happened?
          2. Pop quiz – heading
          3. Have a go hero – heading
        7. Reader feedback
        8. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Getting Started with Pentaho Data Integration
        1. Pentaho Data Integration and Pentaho BI Suite
        2. Exploring the Pentaho Demo
          1. Pentaho Data Integration
          2. Using PDI in real-world scenarios
            1. Loading data warehouses or datamarts
            2. Integrating data
            3. Data cleansing
            4. Migrating information
            5. Exporting data
            6. Integrating PDI along with other Pentaho tools
          3. Pop quiz – PDI data sources
        3. Installing PDI
        4. Time for action – installing PDI
          1. What just happened?
          2. Pop quiz – PDI prerequisites
        5. Launching the PDI graphical designer – Spoon
        6. Time for action – starting and customizing Spoon
          1. What just happened?
          2. Spoon
            1. Setting preferences in the Options window
            2. Storing transformations and jobs in a repository
          3. Creating your first transformation
        7. Time for action – creating a hello world transformation
          1. What just happened?
            1. Directing Kettle engine with transformations
            2. Exploring the Spoon interface
            3. Designing a transformation
            4. Running and previewing the transformation
          2. Pop quiz – PDI basics
        8. Installing MySQL
        9. Time for action – installing MySQL on Windows
          1. What just happened?
        10. Time for action – installing MySQL on Ubuntu
          1. What just happened?
          2. Have a go hero – installing a visual software for administering and querying MySQL
        11. Summary
      9. 2. Getting Started with Transformations
        1. Designing and previewing transformations
        2. Time for action – creating a simple transformation and getting familiar with the design process
          1. What just happened?
          2. Getting familiar with editing features
            1. Using the mouseover assistance toolbar
            2. Working with grids
          3. Understanding the Kettle rowset
          4. Looking at the results in the Execution Results pane
            1. The Logging tab
            2. The Step Metrics tab
          5. Have a go hero – calculating the achieved percentage of work
          6. Have a go hero - calculating the achieved percentage of work (second version)
        3. Running transformations in an interactive fashion
        4. Time for action – generating a range of dates and inspecting the data as it is being created
          1. What just happened?
          2. Adding or modifying fields by using different PDI steps
          3. The Select values step
            1. Getting fields
            2. Date fields
          4. Pop quiz – generating data with PDI
          5. Have a go hero – experiencing different PDI steps
          6. Have a go hero – generating a rowset with dates
        5. Handling errors
        6. Time for action – avoiding errors while converting the estimated time from string to integer
          1. What just happened?
          2. The error handling functionality
        7. Time for action – configuring the error handling to see the description of the errors
          1. What just happened?
          2. Personalizing the error handling
          3. Have a go hero – trying out different ways of handling errors
        8. Summary
      10. 3. Manipulating Real-world Data
        1. Reading data from files
        2. Time for action – reading results of football matches from files
          1. What just happened?
          2. Input files
            1. Input steps
          3. Reading several files at once
        3. Time for action – reading all your files at a time using a single text file input step
          1. What just happened?
        4. Time for action – reading all your files at a time using a single text file input step and regular expressions
          1. What just happened?
            1. Regular expressions
          2. Troubleshooting reading files
          3. Have a go hero – exploring your own files
          4. Pop quiz – providing a list of text files using regular expressions
          5. Have a go hero – measuring the performance of input steps
        5. Sending data to files
        6. Time for action – sending the results of matches to a plain file
          1. What just happened?
          2. Output files
            1. Output steps
          3. Have a go hero – extending your transformations by writing output files
          4. Have a go hero – generate your custom matches.txt file
        7. Getting system information
        8. Time for action – reading and writing matches files with flexibility
          1. What just happened?
          2. The Get System Info step
          3. Running transformations from a terminal window
        9. Time for action – running the matches transformation from a terminal window
          1. What just happened?
          2. Have a go hero – finding out system information
        10. XML files
        11. Time for action – getting data from an XML file with information about countries
          1. What just happened?
          2. What is XML?
            1. PDI transformation files
          3. Getting data from XML files
            1. XPath
            2. Configuring the Get data from the XML step
          4. Kettle variables
            1. How and when you can use variables
          5. Have a go hero – exploring XML files
        12. Summary
      11. 4. Filtering, Searching, and Performing Other Useful Operations with Data
        1. Sorting data
        2. Time for action – sorting information about matches with the Sort rows step
          1. What just happened?
          2. Have a go hero – listing the last match played by each team
        3. Calculations on groups of rows
        4. Time for action – calculating football match statistics by grouping data
          1. What just happened?
          2. Group by Step
          3. Numeric fields
          4. Have a go hero – formatting 99.55
          5. Pop quiz — formatting output fields
          6. Have a go hero – listing the languages spoken by a country
        5. Filtering
        6. Time for action – counting frequent words by filtering
          1. What just happened?
        7. Time for action – refining the counting task by filtering even more
          1. What just happened?
          2. Filtering rows using the Filter rows step
          3. Have a go hero – playing with filters
        8. Looking up data
        9. Time for action – finding out which language people speak
          1. What just happened?
          2. The Stream lookup step
            1. Have a go hero – selecting the most popular of the official languages
            2. Have a go hero – counting words more precisely
          3. Data cleaning
          4. Time for action – fixing words before counting them
            1. What just happened?
            2. Cleansing data with PDI
            3. Have a go hero – counting words by cleaning them first
          5. Summary
      12. 5. Controlling the Flow of Data
        1. Splitting streams
        2. Time for action – browsing new features of PDI by copying a dataset
          1. What just happened?
          2. Copying rows
          3. Have a go hero – recalculating statistics
          4. Distributing rows
        3. Time for action – assigning tasks by distributing
          1. What just happened?
          2. Pop quiz – understanding the difference between copying and distributing
        4. Splitting the stream based on conditions
        5. Time for action – assigning tasks by filtering priorities with the Filter rows step
          1. What just happened?
          2. PDI steps for splitting the stream based on conditions
        6. Time for action – assigning tasks by filtering priorities with the Switch/Case step
          1. What just happened?
          2. Have a go hero – listing languages and countries
          3. Pop quiz – deciding between a Number range step and a Switch/Case step
        7. Merging streams
        8. Time for action – gathering progress and merging it all together
          1. What just happened?
          2. PDI options for merging streams
        9. Time for action – giving priority to Bouchard by using the Append Stream
          1. What just happened?
          2. Have a go hero – sorting and merging all tasks
        10. Treating invalid data by splitting and merging streams
        11. Time for action – treating errors in the estimated time to avoid discarding rows
          1. What just happened?
          2. Treating rows with invalid data
          3. Have a go hero – trying to find missing countries
        12. Summary
      13. 6. Transforming Your Data by Coding
        1. Doing simple tasks with the JavaScript step
        2. Time for action – counting frequent words by coding in JavaScript
          1. What just happened?
          2. Using the JavaScript language in PDI
          3. Inserting JavaScript code using the Modified JavaScript Value Step
            1. Adding fields
            2. Modifying fields
          4. Using transformation predefined constants
          5. Testing the script using the Test script button
        3. Reading and parsing unstructured files with JavaScript
        4. Time for action – changing a list of house descriptions with JavaScript
          1. What just happened?
          2. Looping over the dataset rows
          3. Have a go hero – enhancing the houses file
        5. Doing simple tasks with the Java Class step
        6. Time for action – counting frequent words by coding in Java
          1. What just happened?
          2. Using the Java language in PDI
          3. Inserting Java code using the User Defined Java Class step
            1. Adding fields
            2. Modifying fields
            3. Sending rows to the next step
            4. Data types equivalence
          4. Testing the Java Class using the Test class button
          5. Have a go hero – parameterizing the Java Class
        7. Transforming the dataset with Java
        8. Time for action – splitting the field to rows using Java
          1. What just happened?
        9. Avoiding coding by using purpose built steps
          1. Pop quiz – choosing a scripting language for coding inside a transformation
        10. Summary
      14. 7. Transforming the Rowset
        1. Converting rows to columns
        2. Time for action – enhancing the films file by converting rows to columns
          1. What just happened?
          2. Converting row data to column data by using the Row Denormaliser step
          3. Have a go hero – houses revisited
        3. Aggregating data with a Row Denormaliser step
        4. Time for action – aggregating football matches data with the Row Denormaliser step
          1. What just happened?
          2. Using Row Denormaliser for aggregating data
          3. Have a go hero – calculating statistics by team
        5. Normalizing data
        6. Time for action – enhancing the matches file by normalizing the dataset
          1. What just happened?
          2. Modifying the dataset with a Row Normaliser step
          3. Summarizing the PDI steps that operate on sets of rows
          4. Have a go hero – verifying the benefits of normalizing
          5. Have a go hero – normalizing the Films file
        7. Generating a custom time dimension dataset by using Kettle variables
        8. Time for action – creating the time dimension dataset
          1. What just happened?
          2. Getting variables
        9. Time for action – parameterizing the start and end date of the time dimension dataset
          1. What just happened?
            1. Using the Get Variables step
          2. Have a go hero – enhancing the time dimension
        10. Summary
      15. 8. Working with Databases
        1. Introducing the Steel Wheels sample database
          1. Connecting to the Steel Wheels database
        2. Time for action – creating a connection to the Steel Wheels database
          1. What just happened?
            1. Connecting with Relational Database Management Systems
          2. Pop quiz – connecting to a database in several transformations
          3. Have a go hero – connecting to your own databases
          4. Exploring the Steel Wheels database
        3. Time for action – exploring the sample database
          1. What just happened?
            1. A brief word about SQL
            2. Exploring any configured database with the database explorer
          2. Have a go hero – exploring the sample data in depth
          3. Have a go hero – exploring your own databases
        4. Querying a database
        5. Time for action – getting data about shipped orders
          1. What just happened?
          2. Getting data from the database with the Table input step
          3. Using the SELECT statement for generating a new dataset
            1. Making flexible queries using parameters
        6. Time for action – getting orders in a range of dates using parameters
          1. What just happened?
            1. Adding parameters to your queries
            2. Making flexible queries by using Kettle variables
        7. Time for action – getting orders in a range of dates by using Kettle variables
          1. What just happened?
            1. Using Kettle variables in your queries
          2. Pop quiz – interpreting data types coming from a database
          3. Have a go hero – querying the sample data
        8. Sending data to a database
        9. Time for action – loading a table with a list of manufacturers
          1. What just happened?
          2. Inserting new data into a database table with the Table output step
          3. Inserting or updating data by using other PDI steps
        10. Time for action – inserting new products or updating existing ones
          1. What just happened?
        11. Time for action – testing the update of existing products
          1. What just happened?
          2. Inserting or updating with the Insert/Update step
          3. Have a go hero – populating a films database
          4. Have a go hero – populating the products table
          5. Pop quiz – replacing an Insert/Update step with a Table Output step followed by an Update step
        12. Eliminating data from a database
        13. Time for action – deleting data about discontinued items
          1. What just happened?
          2. Deleting records of a database table with the Delete step
          3. Have a go hero – deleting old orders
          4. Have a go hero – creating the time dimension
        14. Summary
      16. 9. Performing Advanced Operations with Databases
        1. Preparing the environment
        2. Time for action – populating the Jigsaw database
          1. What just happened?
          2. Exploring the Jigsaw database model
        3. Looking up data in a database
          1. Doing simple lookups
        4. Time for action – using a Database lookup step to create a list of products to buy
          1. What just happened?
            1. Looking up values in a database with the Database lookup step
          2. Have a go hero – preparing the delivery of the products
          3. Have a go hero – refining the transformation
          4. Performing complex lookups
        5. Time for action – using a Database join step to create a list of suggested products to buy
          1. What just happened?
            1. Joining data from the database to the stream data by using a Database join step
          2. Have a go hero – rebuilding the list of customers
        6. Introducing dimensional modeling
        7. Loading dimensions with data
        8. Time for action – loading a region dimension with a Combination lookup/update step
          1. What just happened?
        9. Time for action – testing the transformation that loads the region dimension
          1. What just happened?
          2. Describing data with dimensions
            1. Loading Type I SCD with a Combination lookup/update step
          3. Have a go hero – adding regions to the Region dimension
          4. Have a go hero – loading the manufacturers dimension
          5. Storing a history of changes
        10. Time for action – keeping a history of changes in products by using the Dimension lookup/update step
          1. What just happened?
        11. Time for action – testing the transformation that keeps history of product changes
          1. What just happened?
            1. Keeping an entire history of data with a Type II slowly changing dimension
            2. Loading Type II SCDs with the Dimension lookup/update step
          2. Have a go hero – storing a history just for the theme of a product
          3. Have a go hero – loading the Regions dimension as a Type II SCD
          4. Pop quiz – implementing a Type III SCD in PDI
          5. Have a go hero – loading a mini dimension
        12. Summary
      17. 10. Creating Basic Task Flows
        1. Introducing PDI jobs
        2. Time for action – creating a folder with a Kettle job
          1. What just happened?
          2. Executing processes with PDI jobs
            1. Using Spoon to design and run jobs
          3. Pop quiz – defining PDI jobs
        3. Designing and running jobs
        4. Time for action – creating a simple job and getting familiar with the design process
          1. What just happened?
          2. Changing the flow of execution on the basis of conditions
          3. Looking at the results in the Execution results window
            1. The Logging tab
            2. The Job metrics tab
        5. Running transformations from jobs
        6. Time for action – generating a range of dates and inspecting how things are running
          1. What just happened?
          2. Using the Transformation job entry
          3. Have a go hero – loading the dimension tables
        7. Receiving arguments and parameters in a job
        8. Time for action – generating a hello world file by using arguments and parameters
          1. What just happened?
          2. Using named parameters in jobs
          3. Have a go hero – backing up your work
        9. Running jobs from a terminal window
        10. Time for action – executing the hello world job from a terminal window
          1. What just happened?
          2. Have a go hero – experiencing Kitchen
        11. Using named parameters and command-line arguments in transformations
        12. Time for action – calling the hello world transformation with fixed arguments and parameters
          1. What just happened?
          2. Have a go hero – saying hello again and again
          3. Have a go hero – loading the time dimension from a job
        13. Deciding between the use of a command-line argument and a named parameter
          1. Have a go hero – analyzing the use of arguments and named parameters
        14. Summary
      18. 11. Creating Advanced Transformations and Jobs
        1. Re-using part of your transformations
        2. Time for action – calculating statistics with the use of a subtransformations
          1. What just happened?
          2. Creating and using subtransformations
          3. Have a go hero – calculating statistics for all subjects
          4. Have a go hero – counting words more precisely (second version)
          5. Creating a job as a process flow
        3. Time for action – generating top average scores by copying and getting rows
          1. What just happened?
            1. Transferring data between transformations by using the copy/get rows mechanism
          2. Have a go hero – modifying the flow
        4. Iterating jobs and transformations
        5. Time for action – generating custom files by executing a transformation for every input row
          1. What just happened?
          2. Executing for each row
          3. Have a go hero – building lists of products to buy
        6. Enhancing your processes with the use of variables
        7. Time for action – generating custom messages by setting a variable with the name of the examination file
          1. What just happened?
          2. Setting variables inside a transformation
          3. Running a job inside another job with a Job job entry
            1. Understanding the scope of variables
          4. Have a go hero – processing several files at once
          5. Have a go hero – enhancing the jigsaw database update process
          6. Have a go hero – executing the proper jigsaw database update process
          7. Pop quiz – deciding the scope of variables
        8. Summary
      19. 12. Developing and Implementing a Simple Datamart
        1. Exploring the sales datamart
          1. Deciding the level of granularity
        2. Loading the dimensions
        3. Time for action – loading the dimensions for the sales datamart
          1. What just happened?
        4. Extending the sales datamart model
          1. Have a go hero – loading the dimensions for the puzzle star model
        5. Loading a fact table with aggregated data
        6. Time for action – loading the sales fact table by looking up dimensions
          1. What just happened?
          2. Getting the information from the source with SQL queries
          3. Translating the business keys into surrogate keys
            1. Obtaining the surrogate key for Type I SCD
            2. Obtaining the surrogate key for Type II SCD
            3. Obtaining the surrogate key for the Junk dimension
            4. Obtaining the surrogate key for the Time dimension
          4. Pop quiz – creating a product type dimension
          5. Have a go hero – loading a puzzles fact table
        7. Getting facts and dimensions together
        8. Time for action – loading the fact table using a range of dates obtained from the command line
          1. What just happened?
        9. Time for action – loading the SALES star
          1. What just happened?
          2. Have a go hero – enhancing the loading process of the sales fact table
          3. Have a go hero – loading the puzzle sales star
          4. Have a go hero – loading the facts once a month
        10. Automating the administrative tasks
        11. Time for action – automating the loading of the sales datamart
          1. What just happened?
          2. Have a go hero – creating a backup of your work automatically
          3. Have a go hero – enhancing the automation process by sending an email if an error occurs
        12. Summary
      20. A. Working with Repositories
        1. Creating a database repository
        2. Time for action – creating a PDI repository
          1. What just happened?
          2. Creating a database repository to store your transformations and jobs
        3. Working with the repository storage system
        4. Time for action – logging into a database repository
          1. What just happened?
          2. Logging into a database repository using credentials
          3. Creating transformations and jobs in repository folders
          4. Creating database connections, users, servers, partitions, and clusters
          5. Designing jobs and transformations
          6. Backing up and restoring a repository
        5. Examining and modifying the contents of a repository with the Repository Explorer
        6. Migrating from file-based system to repository-based system and vice versa
        7. Summary
      21. B. Pan and Kitchen – Launching Transformations and Jobs from the Command Line
        1. Running transformations and jobs stored in files
        2. Running transformations and jobs from a repository
          1. Specifying command-line options
        3. Kettle variables and the Kettle home directory
        4. Checking the exit code
        5. Providing options when running Pan and Kitchen
        6. Summary
      22. C. Quick Reference – Steps and Job Entries
        1. Transformation steps
        2. Job entries
        3. Summary
      23. D. Spoon Shortcuts
        1. General shortcuts
        2. Designing transformations and jobs
        3. Grids
        4. Repositories
        5. Database wizards
        6. Summary
      24. E. Introducing PDI 5 Features
        1. Welcome page
        2. Usability
        3. Solutions to commonly occurring situations
        4. Backend
        5. Summary
      25. F. Best Practices
        1. Summary
      26. G. Pop Quiz Answers
        1. Chapter 1, Getting Started with Pentaho Data Integration
          1. Pop quiz – PDI data sources
          2. Pop quiz – PDI prerequisites
          3. Pop quiz – PDI basics
        2. Chapter 2, Getting Started with Transformations
          1. Pop quiz – generating data with PDI
        3. Chapter 3, Manipulating Real-world Data
          1. Pop quiz – providing a list of text files using regular expressions
        4. Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data
          1. Pop quiz – formatting output fields
        5. Chapter 5, Controlling the Flow of Data
          1. Pop quiz – deciding between a Number range step and a Switch/Case step
          2. Pop quiz – understanding the difference between copying and distributing
        6. Chapter 6, Transforming Your Data by Coding
          1. Pop quiz – choosing a scripting language for coding inside a transformation
        7. Chapter 8, Working with Databases
          1. Pop quiz – connecting to a database in several transformations
          2. Pop quiz – interpreting data types coming from a database
          3. Pop quiz – interpreting data types coming from a database
        8. Chapter 9, Performing Advanced Operations with Databases
          1. Pop quiz – implementing a Type III SCD in PDI
        9. Chapter 10, Creating Basic Task Flows
          1. Pop quiz – defining PDI jobs
        10. Chapter 11, Creating Advanced Transformations and Jobs
          1. Pop quiz – deciding the scope of variables
        11. Chapter 12, Developing and Implementing a Simple Datamart
          1. Pop quiz – creating a product type dimension
      27. Index