Adding checksums to verify datasets

While there are many ways to verify that your datasets are valid, a common practice is to create a checksum based on the data to determine if it is different from a reference data set. Checksums are a hash of the data provided to the algorithm generating it, making each one nearly unique to the data that built it.

Kettle provides a way to add a checksum to each record in your dataset through the Add a Checksum step.

For this recipe, we will be comparing data between the roller coaster database and a flat file that may have new roller coasters listed in it.

Getting ready

For this recipe, you will need the the files associated with this recipe, which can be downloaded from the book's site. More details about the files ...

Get Pentaho Data Integration Cookbook Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.