Machine learning pipelines in Apache Spark

To end this chapter, we will take a look at how Apache Spark can be used to implement the algorithms that we have previously discussed by taking a look at how its machine learning library, MLlib, works under the hood. MLlib provides a suite of tools designed to make machine learning accessible, scalable, and easy to deploy.

Note that as of Spark 2.0, the MLlib RDD-based API is in maintenance mode. The examples in this book will use the DataFrame-based API, which is now the primary API for MLlib. For more information, please visit https://spark.apache.org/docs/latest/ml-guide.html.

At a high level, the typical implementation of machine learning models can be thought of as an ordered pipeline of algorithms, ...

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.