O'Reilly logo
live online training icon Live Online training

Introduction to Apache Spark 2.1

Introduction to Spark for Data Processing, Analytics, and Machine Learning

Adam Breindel

Learn modern best practices, using the latest Spark features, for high-performance analytics, processing, and modeling on large-scale data sets. Using elementary Scala and accessible to those with basic Scala or Python knowledge, this course will introduce you to the broad functionality of Spark 2.1, providing examples and hands-on activities to follow along with, in a notebook environment.

What you'll learn-and how you can apply it

  • How Spark executes queries and jobs over heterogeneous, distributed data
  • How Spark applications and clusters operate
  • Parallel data processing
  • How Spark analyzes queries or computations and executes them in a distributed cluster
  • Using the newest Spark APIs, features, and best practices, which are not present in the large amount of online Spark material (which is based on older, earlier versions of Spark)

Participants will be able to:

  • Author data processing and transformation scripts
  • Query and analyze data
  • Train, evaluate, and deploy machine learning (predictive analytics) models

This training course is for you because...

  1. You are a data analyst with a SQL background and you need to implement reports or analytic queries over large, heterogeneous datasets.
  2. You are a data engineer with a programming or scripting background and you need to plan or operate data processing clusters and pipelines.
  3. You are a data scientist with a background in Python and you need to train models on large scale datasets, or apply an existing model to large datasets.

Prerequisites

  • Elementary programming skill in Scala or Python
  • Basic familiarity with Java Virtual Machine (JVM) helpful but not required
  • Previous knowledge of Spark is not necessary

Recommendations for downloads of various open-source software will be provided to enrollees before the start of the course.

Recommended Preparation:

Introduction to Apache Spark

About your instructor

  • Presented by Adam Breindel

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Day 1:

  • Welcome, Intro to Spark (15 minutes)
  • Spark, Distributed Data Basics (45 minutes)
  • Programming Spark with SQL, DataFrame, and Dataset APIs (2 hours)

Day 2:

  • Identifying some problems and fixes using the Spark Web UI (30 minutes)
  • RDDs vs. DataFrame/Dataset (30 minutes)
  • Spark Streaming Basics (1 hour 30 minutes)

Day 3:

  • Spark Machine Learning Intro (1 hour 30 minutes)
  • Spark clustering and deployment options, Q&A (1 hour)