O'Reilly logo
live online training icon Live Online training

Machine Learning with Spark ML Libraries


Spark and the Spark machine-learning libraries together provide a scalable enterprise-level framework for implementing an end-to-end machine-learning pipeline from data ingestion to model tuning. This framework aims to bridge the gap between building prototypes in R and Python and implementing production-level software.

Join Vartika Singh to learn how to apply components of the Spark machine-learning framework to your datasets and start extracting value. Through hands-on exercises of increasing complexity, you’ll learn how to do feature engineering in Spark and produce a Scala/Python implementation that leverages Spark in cluster mode to process data.

What you'll learn-and how you can apply it

By the end of this live, online course, you’ll understand:

  • Spark ML Pipelines general concepts
  • The feature engineering components of the Spark machine-learning framework
  • A selection of basic Spark machine-learning algorithms

And you’ll be able to:

  • Construct pipelines in Spark
  • Do feature engineering in Spark
  • Apply a few basic machine-learning algorithms to sample datasets

This training course is for you because...

  • You are a data scientist or engineer working in R or Python with general experience using Spark and would like to start working with Spark’s machine-learning libraries


  • A basic understanding of machine learning and Spark

Materials and downloads needed:

  • A machine with IntelliJ and Spark 2.1 installed

Recommended Preparation:

Thoughtful Machine Learning with Python

Learning Spark

Scalable Machine Learning

Learning Path: Machine Learning

About your instructor

  • Vartika Singh is a solutions architect at Cloudera working primarily in Spark and machine learning. Vartika has more than 15 years of experience with applied machine learning. Previously, she was the team lead for data science at Digilant.


The timeframes are only estimates and may vary according to how the class is progressing

Day One

  • Spark and Spark ML library: Overview of Spark, drivers and executors, and general configuration params (20 minutes) 

  • ML Pipelines: Overview of ML Pipelines, extractors and transformer, and storage and retrieval (5 minutes)

  • Transformer: StringIndexer, IndexToString, OneHotEncoder, VectorIndexer, Tokenizer, and StopWordsRemover (30 minutes)

  • Extractors: tf-idf, Word2Vec, and CountVectorizer (20 minutes) 

  • Exercise (10 minutes) 

  • Break (20 minutes) 

  • Model selection and tuning: Cross-validation and train validation split (10)

  • Evaluation Metrics (10)

  • Basic Stats (10)

  • Exercise and Q&A (20 minutes) 

  • Feature selection: ChiSqSelector, other selectors, and dimensionality reduction (PCA) (15 minutes) 

  • Q&A (10) 

Day Two

  • Clustering: k-means, LDA, and the Gaussian mixture model (20 minutes)
  • Exercise and Q&A (15 minutes)
  • Classification: Logistic regression, random forest classifier, naive bayes and multilayer perceptron classifier (20 minutes)
  • Exercise and Q&A (15 minutes)
  • Deep Learning on Spark (25 minutes)

Break (20 minutes)

  • Regression: Linear regression and decision tree regression (15 minutes)
  • Collaborative filtering (15 minutes)
  • Optimization and other topics (15 minutes)
  • Exercise (10 minutes)
  • Q&A (10 minutes)