Building a random data sample for Weka

Weka is another open source tool that is officially supported by Pentaho, that focuses on data mining. Like it's cousins R and RapidMiner, Weka provides a library of statistical analysis tools that can be integrated into complex decision making systems. For this recipe, we will go over how to build a random dataset for Weka using Kettle.

Getting ready

We will be using the baseball player salaries data that can be found on the book's website or from Lahman's Baseball Archive website, found at http://www.seanlahman.com/baseball-archive/statistics/. The code for this recipe can also be found on the book's website.

This recipe also takes advantage of the ARFF Output plugin. This is available either via the Marketplace ...

Get Pentaho Data Integration Cookbook Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.