Introducing partitioning and clustering

For taking advantage of processing power, distributing rows is a good option. It gives you a better performance, which is critical if you have a heavy processing work or your dataset is huge.

A step further in the distribution of rows is the concept of partitioning. Partitioning is about splitting the dataset into several smaller datasets, but the distribution is made according to a rule that is applied to the rows.

The standard partitioning method offered by PDI is Remainder of division. You choose a partitioning field, and PDI divides its value by the number of predefined partitions.

As an example, in our sample Transformation, we can create a partitioning schema with three partitions and choose ...

Get Learning Pentaho Data Integration 8 CE - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.