Getting data from Hadoop

Just as Kettle simplifies loading data into Hadoop, pulling data back out from the Hadoop File System is just as easy. In fact, we can treat it just like any other data source that is a flat file.

Getting ready

For this recipe, we will be using the Baseball Dataset loaded into Hadoop in the recipe Loading data into Hadoop (also in this chapter). It is recommended that this recipe is performed before continuing.

We will be focusing on the Salaries.csv and the Master.csv datasets. Let us find out just how much money each player earned over the course of their careers.

How to do it...

Perform the following steps to retrieve the baseball data from Hadoop:

  1. Open Spoon and create a new transformation.
  2. In the Design tab, under the

Get Pentaho Data Integration Cookbook Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.