O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Introduction to PySpark

Video Description

In this Introduction to PySpark training course, expert author Alex Robbins will teach you everything you need to know about the Spark Python API. This course is designed for users that already have a basic working knowledge of Python.

You will start by learning how to install Spark, then jump into learning the Spark fundamentals. From there, Alex will teach you about transformations, including filter, pipe, repartition, and distinct. This video tutorial also covers actions, input and output, performance, and running on a cluster. Finally, you will learn advanced topics, including Spark streaming, dataframes and SQL, and MLlib.

Once you have completed this computer based training course, you will have learned everything you need to know about PySpark. Working files are included, allowing you to follow along with the author throughout the lessons.

Table of Contents

  1. Introduction
    1. Introduction And Course Overview 00:02:01
    2. About The Author 00:01:02
    3. Installing Python 00:04:38
    4. Installing iPython And Using Notebooks 00:06:28
    5. How To Access Your Working Files 00:01:15
  2. Installing Spark
    1. Download And Setup 00:03:24
    2. Running The Spark Shell 00:05:35
    3. Running The Spark Shell With iPython 00:06:38
  3. Spark Fundamentals
    1. What Is A Resilient Distributed Dataset - RDD? 00:04:54
    2. Reading A Text File 00:03:34
    3. Actions 00:02:13
    4. Transformations 00:02:30
    5. Persisting Data 00:04:11
  4. Transformations
    1. Map 00:03:04
    2. Filter 00:03:56
    3. Flatmap 00:03:16
    4. MapPartitions 00:04:07
    5. MapPartitionsWithIndex 00:01:51
    6. Sample 00:02:36
    7. Union 00:01:11
    8. Intersection 00:01:28
    9. Distinct 00:02:02
    10. Cartesian 00:03:17
    11. Pipe 00:03:40
    12. Coalesce 00:02:12
    13. Repartition 00:02:29
    14. RepartitionAndSortWithinPartitions 00:03:58
  5. Actions
    1. Reduce 00:04:19
    2. Collect 00:01:56
    3. Count 00:03:05
    4. First 00:01:20
    5. Take 00:01:05
    6. TakeSample 00:03:03
    7. TakeOrdered 00:02:10
    8. SaveAsTextFile 00:04:09
    9. CountByKey 00:02:40
    10. ForEach 00:03:11
  6. Key-Value Pair RDDs
    1. GroupByKey 00:02:31
    2. ReduceByKey 00:03:30
    3. AggregateByKey 00:03:44
    4. SortByKey 00:02:47
    5. Join 00:04:16
    6. CoGroup 00:02:09
  7. Input And Output
    1. WholeTextFile 00:03:15
    2. Pickle Files 00:03:59
    3. HadoopInputFormat 00:05:35
    4. HadoopOutputFormat 00:05:31
  8. Performance
    1. Broadcast Variables 00:04:17
    2. Accumulators 00:05:08
    3. Using A Custom Accumulator 00:04:52
    4. Partitioning 00:07:56
  9. Running On A Cluster
    1. Spark Standalone Cluster 00:04:26
    2. Mesos 00:03:38
    3. Yarn 00:02:28
    4. Client Versus Cluster Mode 00:02:41
  10. Advanced Spark
    1. Spark Streaming 00:04:21
    2. Dataframes And SQL 00:03:28
    3. MLlib 00:04:29
  11. Conclusion
    1. Resources And Where To Go From Here 00:01:02
    2. Wrap Up 00:01:28