Chapter 6. Introducing the ML Package

In the previous chapter, we worked with the MLlib package in Spark that operated strictly on RDDs. In this chapter, we move to the ML part of Spark that operates strictly on DataFrames. Also, according to the Spark documentation, the primary machine learning API for Spark is now the DataFrame-based set of models contained in the spark.ml package.

So, let's get to it!

Note

In this chapter, we will reuse a portion of the dataset we played within the previous chapter. The data can be downloaded from http://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz.

In this chapter, you will learn how to do the following:

  • Prepare transformers, estimators, and pipelines
  • Predict the chances of infant survival using ...

Get Learning PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.