You are previewing Pentaho Data Integration 4 Cookbook.
O'Reilly logo
Pentaho Data Integration 4 Cookbook

Book Description

Over 70 recipes to solve ETL problems using Pentaho Kettle

  • Manipulate your data by exploring, transforming, validating, integrating, and more

  • Work with all kinds of data sources such as databases, plain files, and XML structures among others

  • Use Kettle in integration with other components of the Pentaho Business Intelligence Suite

  • Each recipe is a carefully organized sequence of instructions packed with screenshots, tables, and tips to complete the task as efficiently as possible

In Detail

Pentaho Data Integration (PDI, also called Kettle), one of the data integration tools leaders, is broadly used for all kind of data manipulation such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. Do you need quick solutions to the problems you face while using Kettle?

Pentaho Data Integration 4 Cookbook explains Kettle features in detail through clear and practical recipes that you can quickly apply to your solutions. The recipes cover a broad range of topics including processing files, working with databases, understanding XML structures, integrating with Pentaho BI Suite, and more.

Pentaho Data Integration 4 Cookbook shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs. The initial chapters explain the details about working with databases, files, and XML structures. Then you will see different ways for searching data, executing and reusing jobs and transformations, and manipulating streams. Further, you will learn all the available options for integrating Kettle with other Pentaho tools.

Pentaho Data Integration 4 Cookbook has plenty of recipes with easy step-by-step instructions to accomplish specific tasks. There are examples and code that are ready for adaptation to individual needs.

Learn to solve data manipulation problems using the Pentaho Data Integration tool Kettle.

Table of Contents

  1. Pentaho Data Integration 4 Cookbook
    1. Pentaho Data Integration 4 Cookbook
    2. Credits
    3. About the Authors
    4. About the Reviewers
    5. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    6. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    7. 1. Working with Databases
      1. Introduction
        1. Sample databases
          1. Pentaho BI platform databases
      2. Connecting to a database
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Avoiding creating the same database connection over and over again
          2. Avoiding modifying jobs and transformations every time a connection changes
          3. Specifying advanced connection properties
          4. Connecting to a database not supported by Kettle
          5. Checking the database connection at run-time
      3. Getting data from a database
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Getting data from a database by providing parameters
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Parameters coming in more than one row
          2. Executing the SELECT statement several times, each for a different set of parameters
        5. See also
      5. Getting data from a database by running a query built at runtime
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Inserting or updating rows in a table
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. There's more...
            1. Alternative solution if you just want to insert records
            2. Alternative solution if you just want to update rows
            3. Alternative way for inserting and updating
          3. See also
      7. Inserting new rows where a simple primary key has to be generated
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. There's more...
            1. Using the Combination lookup/update for looking up
          3. See also
      8. Inserting new rows where the primary key has to be generated based on stored values
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. There's more...
          3. See also
      9. Deleting data from a table
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. See also
      10. Creating or altering a database table from PDI (design time)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      11. Creating or altering a database table from PDI (runtime)
        1. How to do it...
        2. How it works...
        3. There's more...
        4. See also
      12. Inserting, deleting, or updating a table depending on a field
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. There's more...
            1. Insert, update, and delete all-in-one
            2. Synchronizing after merge
          3. See also
      13. Changing the database connection at runtime
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      14. Loading a parent-child table
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. See also
    8. 2. Reading and Writing Files
      1. Introduction
      2. Reading a simple file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Alternative notation for a separator
          2. About file format and encoding
          3. About data types and formats
          4. Altering the names, order, or metadata of the fields coming from the file
          5. Reading files with fixed width fields
      3. Reading several files at the same time
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      4. Reading unstructured files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Master/detail files
          2. Log files
      5. Reading files having one field by row
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Reading files with some fields occupying two or more rows
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Writing a simple file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Changing headers
          2. Giving the output fields a format
      8. Writing an unstructured file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      9. Providing the name of a file (for reading or writing) dynamically
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Get System Info
          2. Generating several files simultaneously with the same structure, but different names
      10. Using the name of a file (or part of it) as a field
        1. Getting ready
        2. How to do it...
        3. How it works...
      11. Reading an Excel file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      12. Getting the value of specific cells in an Excel file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Labels and values horizontally arranged
          2. Looking for a given cell
      13. Writing an Excel file with several sheets
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      14. Writing an Excel file with a dynamic number of sheets
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
    9. 3. Manipulating XML Structures
      1. Introduction
      2. Reading simple XML files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. XML data in a field
          2. XML file name in a field
          3. ECMAScript for XML
        5. See also
      3. Specifying fields by using XPath notation
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Getting data from a different path
          2. Getting data selectively
          3. Getting more than one node when the nodes share their XPath notation
          4. Saving time when specifying XPath
      4. Validating well-formed XML files
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. See also
      5. Validating an XML file against DTD definitions
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Validating an XML file against an XSD schema
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      7. Generating a simple XML document
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. There's more...
            1. Generating fields with XML structures
          3. See also
      8. Generating complex XML structures
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. See also
      9. Generating an HTML page using XML and XSL transformations
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
    10. 4. File Management
      1. Introduction
      2. Copying or moving one or more files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Moving files
          2. Detecting the existence of the files before copying them
          3. Creating folders
        5. See also
      3. Deleting one or more files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Figuring out which files have been deleted
        5. See also
      4. Getting files from a remote server
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Specifying files to transfer
          2. Some considerations about connecting to an FTP server
          3. Access via SFTP
          4. Access via FTPS
          5. Getting information about the files being transferred
        5. See also
      5. Putting files on a remote server
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Copying or moving a custom list of files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Deleting a custom list of files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      8. Comparing files and folders
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Comparing folders
      9. Working with ZIP files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Avoiding zipping files
          2. Avoiding unzipping files
        5. See also
    11. 5. Looking for Data
      1. Introduction
      2. Looking for values in a database table
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Taking some action when the lookup fails
          2. Taking some action when there are too many results
          3. Looking for non-existent data
        5. See also
      3. Looking for values in a database (with complex conditions or multiple tables involved)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Looking for values in a database with extreme flexibility
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Looking for values in a variety of sources
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Looking for alternatives when the Stream Lookup step doesn't meet your needs
          2. Speeding up your transformation
          3. Using the Value Mapper step for looking up from a short list of values
        5. See also
      6. Looking for values by proximity
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      7. Looking for values consuming a web service
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      8. Looking for values over an intranet or Internet
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
    12. 6. Understanding Data Flows
      1. Introduction
      2. Splitting a stream into two or more streams based on a condition
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Avoiding the use of Dummy steps
          2. Comparing against the value of a Kettle variable
          3. Avoiding the use of nested Filter Rows steps
          4. Overcoming the difficulties of complex conditions
      3. Merging rows of two streams with the same or different structures
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Making sure that the metadata of the streams is the same
          2. Telling Kettle how to merge the rows of your streams
        5. See also
      4. Comparing two streams and generating differences
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Using the differences to keep a table up to date
        5. See also
      5. Generating all possible pairs formed from two datasets
        1. How to do it...
        2. How it works...
        3. There's more...
          1. Getting variables in the middle of the stream
          2. Limiting the number of output rows
        4. See also
      6. Joining two or more streams based on given conditions
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      7. Interspersing new rows between existent rows
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      8. Executing steps even when your stream is empty
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      9. Processing rows differently based on the row number
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Identifying specific rows
          2. Identifying the last row in the stream
          3. Avoiding using an Add sequence step to enumerate the rows
        5. See also
    13. 7. Executing and Reusing Jobs and Transformations
      1. Introduction
        1. Sample transformations
          1. Sample transformation: Hello
          2. Sample transformation: Random list
          3. Sample transformation: Sequence
          4. Sample transformation: File list
        2. Launching jobs and transformations
      2. Executing a job or a transformation by setting static arguments and parameters
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      3. Executing a job or a transformation from a job by setting arguments and parameters dynamically
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Executing a job or a transformation whose name is determined at runtime
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Executing part of a job once for every row in a dataset
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Accessing the copied rows from jobs, transformations, and other entries
          2. Executing a transformation once for every row in a dataset
          3. Executing a transformation or part of a job once for every file in a list of files
        5. See also
      6. Executing part of a job several times until a condition is true
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Implementing loops in a job
          2. Using the JavaScript step to control the execution of the entries in your job
        5. See also
      7. Creating a process flow
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. There's more...
            1. Serializing/De-serializing data
            2. Other means for transferring or sharing data between transformations
      8. Moving part of a transformation to a subtransformation
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    14. 8. Integrating Kettle and the Pentaho Suite
      1. Introduction
        1. A sample transformation
      2. Creating a Pentaho report with data coming from PDI
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      3. Configuring the Pentaho BI Server for running PDI jobs and transformations
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Executing a PDI transformation as part of a Pentaho process
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. There's more...
            1. Specifying the location of the transformation
            2. Supplying values for named parameters, variables and arguments
            3. Keeping things simple when it's time to deliver a plain file
          3. See also
      5. Executing a PDI job from the Pentaho User Console
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Generating files from the PUC with PDI and the CDA plugin
        1. Getting ready
        2. How to do it...
          1. How it works...
          2. There's more...
      7. Populating a CDF dashboard with data coming from a PDI transformation
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
    15. 9. Getting the Most Out of Kettle
      1. Introduction
      2. Sending e-mails with attached files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Sending logs through an e-mail
          2. Sending e-mails in a transformation
      3. Generating a custom log file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Filtering the log file
          2. Creating a clean log file
          3. Isolating log files for different jobs or transformations
        5. See also
      4. Programming custom functionality
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Data type's equivalence
          2. Generalizing you code
          3. Looking up information with additional steps
          4. Customizing logs
          5. Scripting alternatives to the UDJC step
      5. Generating sample data for testing purposes
        1. How to do it...
        2. How it works...
        3. There's more...
          1. Using Data grid step to generate specific data
          2. Working with subsets of your data
        4. See also
      6. Working with Json files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Reading Json files dynamically
          2. Writing Json files
      7. Getting information about transformations and jobs (file-based)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Transformation XML nodes
          2. Job XML nodes
          3. Steps and entries information
        5. See also
      8. Getting information about transformations and jobs (repository-based)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Transformation tables
          2. Job tables
          3. Database connections tables
    16. A. Data Structures
      1. Book's data structure
        1. Books
        2. Authors
      2. Museum's data structure
        1. Museums
        2. Cities
      3. Outdoor data structure
        1. Products
        2. Categories
      4. Steel Wheels structure