Reading a file and getting the list of words found in it

Let's start by reading a sample file.

Before starting, you'll need at least one text file to play with. The text file used in this tutorial is named smcng10.txt. Its content is about Geological Observations on South America by Darwin, Charles, 1809-1882 and you can download it from https://archive.org/details/geologicalobserv03620gut.

The first thing we will do is to read the file and split the text into one word per row:

  1. Create a new Transformation.
  2. By using the Text file input step, read your file. The trick here is to put as a Separator a sign you are not expecting in the file, such as |. By doing so, every line will be recognized as a single field. Configure the Fields tab with ...

Get Learning Pentaho Data Integration 8 CE - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.