Chapter 9. Practical Machine Learning with Spark

In the previous chapter, we saw the main functionalities of data processing with Spark. In this chapter, we will focus on data science with Spark on a real data problem. During the chapter, you will learn the following topics:

How to share variables across a cluster's nodes
How to create DataFrames from structured (CSV) and semi-structured (JSON) files, save them on disk, and load them
How to use SQL-like syntax to select, filter, join, group, and aggregate datasets, thus making the preprocessing extremely easy
How to handle missing data in the dataset
Which algorithms are available out of the box in Spark for feature engineering and how to use them in a real case scenario
Which learners are available ...

Get Python: Real World Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Python: Real World Machine Learning by Prateek Joshi, John Hearty, Bastiaan Sjardin, Luca Massaron, Alberto Boschetti

Chapter 9. Practical Machine Learning with Spark

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly