Chapter 9. Practical Machine Learning with Spark

In the previous chapter, we saw the main functionalities of data processing with Spark. In this chapter, we will focus on data science with Spark on a real data problem. During the chapter, you will learn the following topics:

  • How to share variables across a cluster's nodes
  • How to create DataFrames from structured (CSV) and semi-structured (JSON) files, save them on disk, and load them
  • How to use SQL-like syntax to select, filter, join, group, and aggregate datasets, thus making the preprocessing extremely easy
  • How to handle missing data in the dataset
  • Which algorithms are available out of the box in Spark for feature engineering and how to use them in a real case scenario
  • Which learners are available ...

Get Python: Real World Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.