Improving the quality of data

In the Filtering Data section in Chapter 6, Controlling the Flow of Data, you identified words found in a text file. On that occasion, you already did some cleaning by eliminating from the text all the characters that weren't part of legal words, for example, parentheses, hyphens, and so on. Recall that you used the Replace in String step for this.

There is more cleansing that we can do in this text. For example, if your intention is to calculate some statistics with geological-related words, you might prefer to discard a lot of words that are valid in the English language but useless for your work. Let's look at a way to get rid of these:

  1. Open the Transformation from Chapter 6Controlling the Flow of Data,  ...

Get Learning Pentaho Data Integration 8 CE - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.