O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Pentaho Data Integration 8 CE - Third Edition

Book Description

Get up and running with the Pentaho Data Integration tool using this hands-on, easy-to-read guide

About This Book

  • Manipulate your data by exploring, transforming, validating, and integrating it using Pentaho Data Integration 8 CE
  • A comprehensive guide exploring the features of Pentaho Data Integration 8 CE
  • Connect to any database engine, explore the databases, and perform all kind of operations on relational databases

Who This Book Is For

This book is a must-have for software developers, business intelligence analysts, IT students, or anyone involved or interested in developing ETL solutions. If you plan on using Pentaho Data Integration for doing any data manipulation task, this book will help you as well. This book is also a good starting point for data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.

What You Will Learn

  • Explore the features and capabilities of Pentaho Data Integration 8 Community Edition
  • Install and get started with PDI
  • Learn the ins and outs of Spoon, the graphical designer tool
  • Learn to get data from all kind of data sources, such as plain files, Excel spreadsheets, databases, and XML files
  • Use Pentaho Data Integration to perform CRUD (create, read, update, and delete) operations on relationaldatabases
  • Populate a data mart with Pentaho Data Integration
  • Use Pentaho Data Integration to organize files and folders, run daily processes, deal with errors, and more

In Detail

Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag-and-drop design and powerful Extract-Tranform-Load (ETL) capabilities. This book shows and explains the new interactive features of Spoon, the revamped look and feel, and the newest features of the tool including transformations and jobs Executors and the invaluable Metadata Injection capability.

We begin with the installation of PDI software and then move on to cover all the key PDI concepts. Each of the chapter introduces new features, enabling you to gradually get practicing with the tool. First, you will learn to do all kind of data manipulation and work with simple plain files. Then, the book teaches you how you can work with relational databases inside PDI. Moreover, you will be given a primer on data warehouse concepts and you will learn how to load data in a data warehouse. During the course of this book, you will be familiarized with its intuitive, graphical and drag-and-drop design environment.

By the end of this book, you will learn everything you need to know in order to meet your data manipulation requirements. Besides, your will be given best practices and advises for designing and deploying your projects.

Style and approach

Step by step guide filled with practical, real world scenarios and examples.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Getting Started with Pentaho Data Integration
    1. Pentaho Data Integration and Pentaho BI Suite
      1. Introducing Pentaho Data Integration
      2. Using PDI in real-world scenarios
        1. Loading data warehouses or data marts
        2. Integrating data
        3. Data cleansing
        4. Migrating information
        5. Exporting data
      3. Integrating PDI along with other Pentaho tools
    2. Installing PDI
    3. Launching the PDI Graphical Designer - Spoon
      1. Starting and customizing Spoon
        1. Exploring the Spoon interface
      2. Extending the PDI functionality through the Marketplace
    4. Introducing transformations
      1. The basics about transformations
      2. Creating a Hello World! Transformation
        1. Designing a Transformation
        2. Previewing and running a Transformation
    5. Installing useful related software
    6. Summary
  3. Getting Started with Transformations
    1. Designing and previewing transformations
      1. Getting familiar with editing features
        1. Using the mouseover assistance toolbar
        2. Adding steps and creating hops
        3. Working with grids
      2. Designing transformations
        1. Putting the editing features in practice
        2. Previewing and fixing errors as they appear
        3. Looking at the results in the execution results pane
          1. The Logging tab
          2. The Step Metrics tab
      3. Running transformations in an interactive fashion
    2. Understanding PDI data and metadata
      1. Understanding the PDI rowset
      2. Adding or modifying fields by using different PDI steps
      3. Explaining the PDI data types
    3. Handling errors
      1. Implementing the error handling functionality
      2. Customizing the error handling
    4. Summary
  4. Creating Basic Task Flows
    1. Introducing jobs
      1. Learning the basics about jobs
      2. Creating a Simple Job
    2. Designing and running jobs
      1. Revisiting the Spoon interface and the editing features
      2. Designing jobs
        1. Getting familiar with the job design process
        2. Looking at the results in the Execution results window
          1. The Logging tab
          2. The Job metrics tab
        3. Enriching your work by sending an email
    3. Running transformations from a Job
      1. Using the Transformation Job Entry 
    4. Understanding and changing the flow of execution
      1. Changing the flow of execution based on conditions
      2. Forcing a status with an abort Job or success entry
      3. Changing the execution to be synchronous
    5. Managing files
      1. Creating a Job that moves some files
        1. Selecting files and folders
        2. Working with regular expressions
      2. Summarizing the Job entries that deal with files
      3. Customizing the file management
    6. Knowing the basics about Kettle variables
      1. Understanding the kettle.properties file
      2. How and when you can use variables
    7. Summary
  5. Reading and Writing Files
    1. Reading data from files
      1. Reading a simple file
        1. Troubleshooting reading files
      2. Learning to read all kind of files
        1. Specifying the name and location of the file
          1. Reading several files at the same time
          2. Reading files that are compressed or located on a remote server
          3. Reading a file whose name is known at runtime
        2. Describing the incoming fields
          1. Reading Date fields
          2. Reading Numeric fields
        3. Reading only a subset of the file
      3. Reading the most common kinds of sources
        1. Reading text files
        2. Reading spreadsheets
        3. Reading XML files
        4. Reading JSON files
    2. Outputting data to files
      1. Creating a simple file
      2. Learning to create all kind of files and write data into them
        1. Providing the name and location of an output file
          1. Creating a file whose name is known only at runtime
          2. Creating several files whose name depend on the content of the file
        2. Describing the content of the output file
          1. Formatting Date fields
          2. Formatting Numeric fields
      3. Creating the most common kinds of files
        1. Creating text files
        2. Creating spreadsheets
        3. Creating XML files
        4. Creating JSON files
    3. Working with Big Data and cloud sources
      1. Reading files from an AWS S3 instance
      2. Writing files to an AWS S3 instance
      3. Getting data from HDFS
      4. Sending data to HDFS
    4. Summary
  6. Manipulating PDI Data and Metadata
    1. Manipulating simple fields
      1. Working with strings
        1. Extracting parts of strings using regular expressions
        2. Searching and replacing using regular expressions
      2. Doing some math with Numeric fields
      3. Operating with dates
        1. Performing simple operations on dates
          1. Subtracting dates with the Calculator step
        2. Getting information relative to the current date
          1. Using the Get System Info step
        3. Performing other useful operations on dates
          1. Getting the month names with a User Defined Java Class step
      4. Modifying the metadata of streams
    2. Working with complex structures
      1. Working with XML
        1. Introducing XML terminology
        2. Getting familiar with the XPath notation
        3. Parsing XML structures with PDI
          1. Reading an XML file with the Get data from XML step
          2. Parsing an XML structure stored in a field
        4. PDI Transformation and Job files
      2. Parsing JSON structures
        1. Introducing JSON terminology
        2. Getting familiar with the JSONPath notation
        3. Parsing JSON structures with PDI
          1. Reading a JSON file with the JSON input step
          2. Parsing a JSON structure stored in a field
    3. Summary
  7. Controlling the Flow of Data
    1. Filtering data
      1. Filtering rows upon conditions
        1. Reading a file and getting the list of words found in it
        2. Filtering unwanted rows with a Filter rows step
        3. Filtering rows by using the Java Filter step
      2. Filtering data based on row numbers
    2. Splitting streams unconditionally
      1. Copying rows
      2. Distributing rows
      3. Introducing partitioning and clustering
    3. Splitting the stream based on conditions
      1. Splitting a stream based on a simple condition
      2. Exploring PDI steps for splitting a stream based on conditions
    4. Merging streams in several ways
      1. Merging two or more streams
      2. Customizing the way of merging streams
    5. Looking up data
      1. Looking up data with a Stream lookup step
    6. Summary
  8. Cleansing, Validating, and Fixing Data
    1. Cleansing data
      1. Cleansing data by example
        1. Standardizing information
        2. Improving the quality of data
      2. Introducing PDI steps useful for cleansing data
      3. Dealing with non-exact matches
        1. Cleansing by doing a fuzzy search
        2. Deduplicating non-exact matches
    2. Validating data
      1. Validating data with PDI
        1. Validating and reporting errors to the log
      2. Introducing common validations and their implementation with PDI
    3. Treating invalid data by splitting and merging streams
      1. Fixing data that doesn't match the rules
    4. Summary
  9. Manipulating Data by Coding
    1. Doing simple tasks with the JavaScript step
      1. Using the JavaScript language in PDI
      2. Inserting JavaScript code using the JavaScript step
        1. Adding fields
        2. Modifying fields
        3. Organizing your code
      3. Controlling the flow using predefined constants
      4. Testing the script using the Test script button
    2. Parsing unstructured files with JavaScript
    3. Doing simple tasks with the Java Class step
      1. Using the Java language in PDI
      2. Inserting Java code using the Java Class step
        1. Learning to insert java code in a Java Class step
        2. Data types equivalence
        3. Adding fields
        4. Modifying fields
        5. Controlling the flow with the putRow() function
      3. Testing the Java Class using the Test class button
    4. Getting the most out of the Java Class step
      1. Receiving parameters
      2. Reading data from additional steps
      3. Redirecting data to different target steps
      4. Parsing JSON structures
    5. Avoiding coding using purpose-built steps
    6. Summary
  10. Transforming the Dataset
    1. Sorting data
      1. Sorting a dataset with the sort rows step
    2. Working on groups of rows
      1. Aggregating data
      2. Summarizing the PDI steps that operate on sets of rows
    3. Converting rows to columns
      1. Converting row data to column data using the Row denormaliser step
      2. Aggregating data with a Row Denormaliser step
    4. Normalizing data
      1. Modifying the dataset with a Row Normaliser step
    5. Going forward and backward across rows
      1. Picking rows backward and forward with the Analytic Query step
    6. Summary
  11. Performing Basic Operations with Databases
    1. Connecting to a database and exploring its content
      1. Connecting with Relational Database Management Systems
      2. Exploring a database with the Database Explorer
    2. Previewing and getting data from a database
      1. Getting data from the database with the Table input step
      2. Using the Table input step to run flexible queries
        1. Adding parameters to your queries
        2. Using Kettle variables in your queries
    3. Inserting, updating, and deleting data
      1. Inserting new data into a database table
      2. Inserting or updating data with the Insert / Update step
      3. Deleting records of a database table with the Delete step
      4. Performing CRUD operations with more flexibility
    4. Verifying a connection, running DDL scripts, and doing other useful tasks
    5. Looking up data in different ways
      1. Doing simple lookups with the Database Value Lookup step
        1. Making a performance difference when looking up data in a database
      2. Performing complex database lookups
        1. Looking for data using a Database join step
        2. Looking for data using a Dynamic SQL row step
    6. Summary
  12. Loading Data Marts with PDI
    1. Preparing the environment
      1. Exploring the Jigsaw database model
      2. Creating the database and configuring the environment
    2. Introducing dimensional modeling
    3. Loading dimensions with data
      1. Learning the basics of dimensions
        1. Understanding dimensions technical details
      2. Loading a time dimension
      3. Introducing and loading Type I slowly changing dimensions
        1. Loading a Type I SCD with a combination lookup/update step
      4. Introducing and loading Type II slowly changing dimension
        1. Loading Type II SCDs with a dimension lookup/update step
          1. Loading a Type II SDC for the first time
          2. Loading a Type II SDC and verifying how history is kept
      5. Explaining and loading Type III SCD and Hybrid SCD
      6. Loading other kinds of dimensions
        1. Loading a mini dimension
        2. Loading junk dimensions
        3. Explaining degenerate dimensions
    4. Loading fact tables
      1. Learning the basics about fact tables
        1. Deciding the level of granularity
      2. Translating the business keys into surrogate keys
        1. Obtaining the surrogate key for Type I SCD
        2. Obtaining the surrogate key for Type II SCD
        3. Obtaining the surrogate key for the junk dimension
        4. Obtaining the surrogate key for a time dimension
      3. Loading  a cumulative fact table
      4. Loading  a snapshot fact table
        1. Loading a fact table by inserting snapshot data
        2. Loading a fact table by overwriting snapshot data
    5. Summary
  13. Creating Portable and Reusable Transformations
    1. Defining and using Kettle variables
      1. Introducing all kinds of Kettle variables
        1. Explaining predefined variables
        2. Revisiting the kettle.properties file
        3. Defining variables at runtime
          1. Setting a variable with a constant value
          2. Setting a variable with a value unknown beforehand
          3. Setting variables with partial or total results of your flow
        4. Defining and using named parameters
      2. Using variables as fields of your stream
    2. Creating reusable Transformations
      1. Creating and executing sub-transformations
        1. Creating and testing a sub-transformation
        2. Executing a sub-transformation
      2. Introducing more elaborate sub-transformations
    3. Making the data flow between transformations
      1. Transferring data using the copy/get rows mechanism
    4. Executing transformations in an iterative way
      1. Using Transformation executors
      2. Configuring the executors with advanced settings
        1. Getting the results of the execution of the inner transformation
        2. Working with groups of data
        3. Using variables and named parameters
        4. Continuing the flow after executing the inner transformation
    5. Summary
  14. Implementing Metadata Injection
    1. Introducing metadata injection
      1. Explaining how metadata injection works
      2. Creating a template Transformation
      3. Injecting metadata
    2. Discovering metadata and injecting it
    3. Identifying use cases to implement metadata injection
    4. Summary
  15. Creating Advanced Jobs
    1. Enhancing your processes with the use of variables
      1. Running nested jobs
      2. Understanding the scope of variables
      3. Using named parameters
      4. Using variables to create flexible processes
        1. Using variables to name jobs and transformations
        2. Using variables to name Job and Transformation folders
    2. Accessing copied rows for different purposes
      1. Using the copied rows to manage files in advanced ways
      2. Using the copied rows as parameters of a Job or Transformation
    3. Working with filelists
      1. Maintaining a filelist
      2. Using the filelist for different purposes
        1. Attaching files in an email
        2. Copying, moving, and deleting files
        3. Introducing other ways to process the filelist
    4. Executing jobs in an iterative way
      1. Using Job executors
      2. Configuring the executors with advanced settings
        1. Getting the results of the execution of the job
        2. Working with groups of data
        3. Using variables and named parameters
        4. Capturing the result filenames
    5. Summary
  16. Launching Transformations and Jobs from the Command Line
    1. Using the Pan and Kitchen utilities
      1. Running jobs and transformations
      2. Checking the exit code
    2. Supplying named parameters and variables
    3. Using command-line arguments
      1. Deciding between the use of a command-line argument and named parameters
    4. Sending the output of executions to log files
    5. Automating the execution
    6. Summary
  17. Best Practices for Designing and Deploying a PDI Project
    1. Setting up a new project
      1. Setting up the local environment
      2. Defining a folder structure for the project
      3. Dealing with external resources
      4. Defining and adopting a versioning system
    2. Best practices to design jobs and transformations
      1. Styling your work
      2. Making the work portable
      3. Designing and developing reusable jobs and transformations
    3. Maximizing the performance
      1. Analyzing Steps Metrics
      2. Analyzing performance graphs
    4. Deploying the project in different environments
      1. Modifying the Kettle home directory
        1. Modifying the Kettle home in Windows
        2. Modifying the Kettle home in Unix-like systems
    5. Summary