The final thing

As we mentioned earlier, one of the interesting additions to spark 2.0.0 is the ML pipeline. A pipeline is nothing but a linear graph of transformers and estimators. If we look at the classes we have been using, they are either transformers or estimators. We had a decent pipeline for our classification example, as follows:

We started with Passengers, which was the Dataset that we read in.

  • Passengers1 was after the feature extraction.
  • Passenders2 was after StringIndexer.
  • Passengers3 was after the na.drop() function.
  • Passengers4 was after the VectorAssembler() function.
  • The algTree object was the algorithm object.

We would have created a pipeline:

valtreePipeline = new Pipeline().setStages(Array(indexer, assembler, algTree)) 

Then, we would ...

Get Fast Data Processing with Spark 2 - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.