Chapter 9. Machine Learning

Machine learning computations aim to derive predictive models from current and historical data. The inherent premise is that a learned algorithm will improve with more training or experience, and in particular, machine learning algorithms can achieve extremely effective results for very narrow domains using models trained from large datasets.

As a result, computations of scale are implicated in most machine learning algorithms. For this reason, machine learning computations are well suited to a distributed computing paradigm, like Spark, in order to leverage large training sets to produce meaningful results. This chapter introduces the built-in Spark machine learning library, Spark MLlib, which consists of many common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as a new “ML-pipeline” framework, spark.ml, which provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.1

Scalable Machine Learning with Spark

In Chapter 4, we introduced Spark as an in-memory distributed computing engine that can run on a Hadoop cluster. But additionally, the Spark platform ships with several built-in components that utilize Spark’s processing engine to enable other types of analytical workloads, which benefit from Spark’s computational optimizations. In this chapter, we’ll take a closer look at Spark’s built-in machine ...

Get Data Analytics with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.