You are previewing Pentaho Data Integration Cookbook Second Edition.
O'Reilly logo
Pentaho Data Integration Cookbook Second Edition

Book Description

Over 100 recipes for building open source ETL solutions with Pentaho Data Integration

  • Intergrate Kettle in integration with other components of the Pentaho Business Intelligence Suite, to build and publish Mondrian schemas,create reports, and populatedashboards

  • This book contains an organized sequence of recipes packed with screenshots, tables, and tips so you can complete the tasks as efficiently as possible

  • Manipulate your data by exploring, transforming, validating, integrating, and performing data analysis

In Detail

Pentaho Data Integration is the premier open source ETL tool, providing easy, fast, and effective ways to move and transform data. While PDI is relatively easy to pick up, it can take time to learn the best practices so you can design your transformations to process data faster and more efficiently. If you are looking for clear and practical recipes that will advance your skills in Kettle, then this is the book for you.

Pentaho Data Integration Cookbook Second Edition guides you through the features of explains the Kettle features in detail and provides easy to follow recipes on file management and databases that can throw a curve ball to even the most experienced developers.

Pentaho Data Integration Cookbook Second Edition provides updates to the material covered in the first edition as well as new recipes that show you how to use some of the key features of PDI that have been released since the publication of the first edition. You will learn how to work with various data sources – from relational and NoSQL databases, flat files, XML files, and more. The book will also cover best practices that you can take advantage of immediately within your own solutions, like building reusable code, data quality, and plugins that can add even more functionality.

Pentaho Data Integration Cookbook Second Edition will provide you with the recipes that cover the common pitfalls that even seasoned developers can find themselves facing. You will also learn how to use various data sources in Kettle as well as advanced features.

Table of Contents

  1. Pentaho Data Integration Cookbook Second Edition
    1. Table of Contents
    2. Pentaho Data Integration Cookbook Second Edition
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Working with Databases
      1. Introduction
        1. Sample databases
        2. Pentaho BI platform databases
      2. Connecting to a database
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Avoiding creating the same database connection over and over again
          2. Avoiding modifying jobs and transformations every time a connection changes
          3. Specifying advanced connection properties
          4. Connecting to a database not supported by Kettle
          5. Checking the database connection at runtime
      3. Getting data from a database
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Getting data from a database by providing parameters
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Parameters coming in more than one row
          2. Executing the SELECT statement several times, each for a different set of parameters
        5. See also
      5. Getting data from a database by running a query built at runtime
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Inserting or updating rows in a table
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Alternative solution if you just want to insert records
          2. Alternative solution if you just want to update rows
          3. Alternative way for inserting and updating
        5. See also
      7. Inserting new rows where a simple primary key has to be generated
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Using the Combination lookup/update for looking up
        5. See also
      8. Inserting new rows where the primary key has to be generated based on stored values
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      9. Deleting data from a table
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      10. Creating or altering a database table from PDI (design time)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      11. Creating or altering a database table from PDI (runtime)
        1. How to do it...
        2. How it works...
        3. There's more...
        4. See also
      12. Inserting, deleting, or updating a table depending on a field
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Insert, update, and delete all-in-one
          2. Synchronizing after merge
        5. See also
      13. Changing the database connection at runtime
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      14. Loading a parent-child table
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      15. Building SQL queries via database metadata
        1. Getting ready
        2. How to do It...
        3. How it works...
        4. See also
      16. Performing repetitive database design tasks from PDI
        1. Getting ready
        2. How to do It...
        3. How it works...
        4. See also
    9. 2. Reading and Writing Files
      1. Introduction
      2. Reading a simple file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Alternative notation for a separator
          2. About file format and encoding
          3. About data types and formats
          4. Altering the names, order, or metadata of the fields coming from the file
          5. Reading files with fixed width fields
      3. Reading several files at the same time
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      4. Reading semi-structured files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Master/detail files
          2. Logfiles
        5. See also
      5. Reading files having one field per row
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Reading files with some fields occupying two or more rows
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Writing a simple file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Changing headers
          2. Giving the output fields a format
      8. Writing a semi-structured file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      9. Providing the name of a file (for reading or writing) dynamically
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Get System Info
          2. Generating several files simultaneously with the same structure, but different names
      10. Using the name of a file (or part of it) as a field
        1. Getting ready
        2. How to do it...
        3. How it works...
      11. Reading an Excel file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      12. Getting the value of specific cells in an Excel file
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Looking for a given cell
      13. Writing an Excel file with several sheets
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      14. Writing an Excel file with a dynamic number of sheets
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      15. Reading data from an AWS S3 Instance
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
    10. 3. Working with Big Data and Cloud Sources
      1. Introduction
      2. Loading data into Salesforce.com
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      3. Getting data from Salesforce.com
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      4. Loading data into Hadoop
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Getting data from Hadoop
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Loading data into HBase
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      7. Getting data from HBase
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      8. Loading data into MongoDB
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      9. Getting data from MongoDB
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
    11. 4. Manipulating XML Structures
      1. Introduction
      2. Reading simple XML files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. XML data in a field
          2. XML file name in a field
        5. See also
      3. Specifying fields by using the Path notation
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Getting data from a different path
          2. Getting data selectively
          3. Getting more than one node when the nodes share their Path notation
          4. Saving time when specifying Path
      4. Validating well-formed XML files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      5. Validating an XML file against DTD definitions
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Validating an XML file against an XSD schema
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      7. Generating a simple XML document
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Generating fields with XML structures
        5. See also
      8. Generating complex XML structures
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      9. Generating an HTML page using XML and XSL transformations
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      10. Reading an RSS Feed
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      11. Generating an RSS Feed
        1. Getting ready
        2. How to do it...
        3. How it works
        4. There's more...
        5. See also
    12. 5. File Management
      1. Introduction
      2. Copying or moving one or more files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Moving files
          2. Detecting the existence of the files before copying them
          3. Creating folders
        5. See also
      3. Deleting one or more files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Figuring out which files have been deleted
        5. See also
      4. Getting files from a remote server
        1. How to do it...
        2. How it works...
        3. There's more...
          1. Specifying files to transfer
          2. Some considerations about connecting to an FTP server
          3. Access via SFTP
          4. Access via FTPS
          5. Getting information about the files being transferred
        4. See also
      5. Putting files on a remote server
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Copying or moving a custom list of files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Deleting a custom list of files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      8. Comparing files and folders
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Comparing folders
      9. Working with ZIP files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Avoiding zipping files
          2. Avoiding unzipping files
        5. See also
      10. Encrypting and decrypting files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
    13. 6. Looking for Data
      1. Introduction
      2. Looking for values in a database table
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Taking some action when the lookup fails
          2. Taking some action when there are too many results
          3. Looking for non-existent data
        5. See also
      3. Looking for values in a database with complex conditions
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Looking for values in a database with dynamic queries
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Looking for values in a variety of sources
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Looking for alternatives when the Stream Lookup step doesn't meet your needs
          2. Speeding up your transformation
          3. Using the Value Mapper step for looking up from a short list of values
        5. See also
      6. Looking for values by proximity
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      7. Looking for values by using a web service
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      8. Looking for values over intranet or the Internet
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      9. Validating data at runtime
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
    14. 7. Understanding and Optimizing Data Flows
      1. Introduction
      2. Splitting a stream into two or more streams based on a condition
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Avoiding the use of Dummy steps
          2. Comparing against the value of a Kettle variable
          3. Avoiding the use of nested Filter rows steps
          4. Overcoming the difficulties of complex conditions
      3. Merging rows of two streams with the same or different structures
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Making sure that the metadata of the streams is the same
          2. Telling Kettle how to merge the rows of your streams
        5. See also
      4. Adding checksums to verify datasets
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Comparing two streams and generating differences
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Using the differences to keep a table up-to-date
        5. See also
      6. Generating all possible pairs formed from two datasets
        1. How to do it...
        2. How it works...
        3. There's more...
          1. Getting variables in the middle of the stream
          2. Limiting the number of output rows
        4. See also
      7. Joining two or more streams based on given conditions
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      8. Interspersing new rows between existent rows
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      9. Executing steps even when your stream is empty
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      10. Processing rows differently based on the row number
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Identifying specific rows
          2. Identifying the last row in the stream
          3. Avoiding using an Add sequence step to enumerate the rows
        5. See also
      11. Processing data into shared transformations via filter criteria and subtransformations
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      12. Altering a data stream with Select values
        1. How to do it...
        2. How it works...
      13. Processing multiple jobs or transformations in parallel
        1. How to do it...
        2. How it works...
        3. See also
    15. 8. Executing and Re-using Jobs and Transformations
      1. Introduction
        1. Sample transformations
          1. Sample transformation – hello
          2. Sample transformation – random list
          3. Sample transformation – sequence
          4. Sample transformation – file list
      2. Launching jobs and transformations
        1. How to do it...
        2. How it works...
      3. Executing a job or a transformation by setting static arguments and parameters
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Executing a job or a transformation from a job by setting arguments and parameters dynamically
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Executing a job or a transformation whose name is determined at runtime
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      6. Executing part of a job once for every row in a dataset
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Accessing the copied rows from jobs, transformations, and other entries
          2. Executing a transformation once for every row in a dataset
          3. Executing a transformation or part of a job once for every file in a list of files
        5. See also
      7. Executing part of a job several times until a condition is true
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Implementing loops in a job
          2. Using the JavaScript step to control the execution of the entries in your job
        5. See also
        6. Creating a process flow
        7. Getting ready
        8. How to do it...
        9. How it works...
        10. There's more...
          1. Serializing/De-serializing data
          2. Other means for transferring or sharing data between transformations
      8. Moving part of a transformation to a subtransformation
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      9. Using Metadata Injection to re-use transformations
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    16. 9. Integrating Kettle and the Pentaho Suite
      1. Introduction
        1. A sample transformation
      2. Creating a Pentaho report with data coming from PDI
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      3. Creating a Pentaho report directly from PDI
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Configuring the Pentaho BI Server for running PDI jobs and transformations
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Executing a PDI transformation as part of a Pentaho process
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Specifying the location of the transformation
          2. Supplying values for named parameters, variables and arguments
          3. Keeping things simple when it's time to deliver a plain file
        5. See also
      6. Executing a PDI job from the Pentaho User Console
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      7. Generating files from the PUC with PDI and the CDA plugin
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      8. Populating a CDF dashboard with data coming from a PDI transformation
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    17. 10. Getting the Most Out of Kettle
      1. Introduction
      2. Sending e-mails with attached files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Sending logs through an e-mail
          2. Sending e-mails in a transformation
      3. Generating a custom logfile
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Filtering the logfile
          2. Creating a clean logfile
          3. Isolating logfiles for different jobs or transformations
        5. See also
      4. Running commands on another server
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      5. Programming custom functionality
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Data type's equivalence
          2. Generalizing your UDJC code
          3. Looking up information with additional steps
          4. Customizing logs
          5. Scripting alternatives to the UDJC step
      6. Generating sample data for testing purposes
        1. How to do it...
        2. How it works...
        3. There's more...
          1. Using a Data grid step to generate specific data
          2. Working with subsets of your data
        4. See also
      7. Working with JSON files
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Reading JSON files dynamically
          2. Writing JSON files
      8. Getting information about transformations and jobs (file-based)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Job XML nodes
          2. Steps and entries information
        5. See also
      9. Getting information about transformations and jobs (repository-based)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Transformation tables
          2. Job tables
          3. Database connections tables
      10. Using Spoon's built-in optimization tools
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    18. 11. Utilizing Visualization Tools in Kettle
      1. Introduction
      2. Managing plugins with the Marketplace
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      3. Data profiling with DataCleaner
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      4. Visualizing data with AgileBI
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      5. Using Instaview to analyze and visualize data
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
    19. 12. Data Analytics
      1. Introduction
      2. Reading data from a SAS datafile
        1. Why read a SAS file?
        2. Getting ready
        3. How to do it...
        4. How it works...
        5. See also
      3. Studying data via stream statistics
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      4. Building a random data sample for Weka
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
    20. A. Data Structures
      1. Books data structure
        1. Books
        2. Authors
      2. museums data structure
        1. museums
        2. cities
      3. outdoor data structure
        1. products
        2. categories
      4. Steel Wheels data structure
      5. Lahman Baseball Database
    21. B. References
      1. Books
      2. Online
    22. Index