Chapter 18. Real-Time Data Integration

In this chapter, we offer a closer look at how real-time data integration can be performed with Kettle. You'll start by exploring the main challenges and requirements to figure out when it will be useful for you to deploy this type of data integration.

After that, we explain why the transformation engine is a good match for streaming real-time BI solutions, and discuss the main pitfalls and considerations you need to keep in mind.

As an example, we include the full code for a (near) real-time Twitter client that continuously updates a database table for further consumption by a dashboard. Finally, we cover third-party software such as database log readers and we provide guidelines on how to implement your own Java Message Service (JMS) solutions in Kettle.

Introduction to Real-Time ETL

In a typical data integration setting, jobs and transformations are run at specific times. For example, it's quite typical to have nightly, weekly, or monthly batch runs in place. The term batch run comes from the fact that a whole batch of data or group of work is executed in sequence in one go. It is also referred to as batch processing. Usually the batch is scheduled to run at a time when computing resources are readily available. For example, most data warehouses are updated with batch processing during the night when there are few users on the operational systems.

Usually it's sufficient to have nightly jobs in place to satisfy your requirements. In fact, the ...

Get Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.