O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Path: Getting Up and Running with Apache Spark

Video Description

Spark is a powerful distributed computing engine for big data, and has emerged as a leading tool in the industry with its focus on improving efficiency and usability. Tutorials and sessions in this Learning Path will teach you about Spark 2.0 libraries, tips and tricks for deploying Spark in production and at scale, and how to get up and running with Spark to write your own Spark applications.

Table of Contents

  1. The state of Spark and where it is going in 2016 - Reynold Xin (Databricks) 00:39:02
  2. Overview
    1. Overview 00:02:06
  3. Spark Datasets and Structured Streaming
    1. Spark Overview 00:02:11
    2. Spark Wordcount Using RDD Example 00:05:01
    3. Spark Wordcount Using Scala Example 00:02:37
    4. Spark and Datasets 00:01:56
    5. Spark Wordcount Using Datasets Example 00:03:06
    6. Joining Data Using Spark Datasets 00:03:32
    7. Structured Streaming Overview 00:03:18
    8. Spark Structured Streaming Wordcount Example 00:03:20
  4. Spark Structured Streaming
    1. Spark Structured Streaming 00:00:46
    2. Netcat Socket Structured Streaming Example 00:02:27
    3. Socket Structured Streaming Example 00:02:55
    4. Spark Structured Streaming Parsing Data 00:02:56
    5. Constructing Columns in Structured Streaming 00:02:47
    6. Selecting and Filtering Columns Using Structured Streaming 00:02:07
    7. GroupBy and Aggregation in Structured Streaming 00:03:33
    8. Joining Structured Stream with Datasets 00:03:39
    9. SQL Queries in Spark Structured Streaming 00:02:19
  5. DStream Comparison
    1. Comparing Structured Streaming with DStream 00:03:39
    2. Custom Receivers in Spark DStream 00:02:18
    3. Iterative Wordcount Using Spark DStream 00:03:30
    4. Cumulative Wordcount using Spark DStream 00:06:31
    5. Benefits of Spark Tungsten 00:04:43
    6. Tungsten Performance Benefit Demonstration 00:02:58
    7. Benefits of Spark Catalyst 00:03:18
    8. Viewing Query Plans in Spark Shell 00:01:36
    9. Visualizing Query Stages in Spark UI Viewer 00:00:51
    10. Viewing Spark Catalyst-Optimized Physical Plans 00:02:56
  6. Standalone Spark Streaming Applications
    1. Writing Standalone Spark Streaming Applications 00:01:03
    2. Two Environments for Running Spark 00:01:57
    3. Spark Streaming Standalone Code - Meetup Events Example 00:07:37
    4. Scala Build Tool (SBT) and Spark 00:06:01
    5. Compiling and Building a Standalone Spark Application 00:04:29
    6. Spark Twitter Streaming Example 00:03:54
    7. Not your father's database: How to use Apache Spark properly in your big data architecture - Vida Ha (Databricks) 00:38:20
    8. Beyond shuffling: Tips and tricks for scaling Spark jobs - Holden Karau (IBM) 00:41:25
    9. Top five mistakes when writing Spark applications - Ted Malaska (Cloudera) and Mark Grover (Cloudera) 00:38:56
  7. Introduction
    1. Introduction and Course Overview 00:04:10
    2. About the Author 00:00:35
    3. Spark’s concepts and approach 00:06:04
    4. Resilient Distributed Databases (RDD) 00:05:03
    5. Creating a Project in IDEA 00:02:54
    6. How To Access Your Working Files 00:01:15
  8. Spark Core API & Best practices
    1. Base RDD 00:06:46
    2. Transformations 00:05:35
    3. Actions - Part 1 00:01:40
    4. Actions - Part 2 00:02:42
    5. Hadoop Combiners In Spark 00:04:52
    6. Direct Acyclic Graph And Lazy Evaluation 00:07:20
    7. Joins 00:06:15
  9. Closure serialization
    1. How does the magic of Spark works 00:07:30
    2. Serializers and how to change them 00:04:10
  10. Shared variables and performance
    1. Broadcast 00:04:07
    2. Accumulators 00:05:05
    3. Caching & Persistence 00:09:22
  11. Spark SQL
    1. Spark SQL 00:12:32
    2. Inferring A Schema 00:07:38
    3. Applying A Schema 00:06:27
    4. Loading And Writing 00:06:07
    5. SQL Caching And UDF 00:08:48
  12. Spark MLLib
    1. Spark MLLib And Supervised Example - SVM 00:10:02
    2. Unsupervised With Iris Dataset - KMeans 00:08:54
  13. Spark GraphX
    1. Graph Construction 00:07:06
    2. Graph Algorithms 00:06:52
  14. Spark Streaming
    1. Streaming And The Microbatch 00:13:57
    2. Mutable Transformations And Checkpointing 00:09:07
    3. Windows And RDD Transformations 00:08:43
    4. Streaming With Spark SQL, MLLib And Core 00:12:28
  15. Deployment and Infrastructure
    1. Cluster Managers And Submission - Standalone, Mesos And Yarn 00:13:20
  16. Conclusion
    1. Resources And Where To Go From Here 00:04:06
  17. Introduction
    1. Introduction And Course Overview 00:02:01
    2. About The Author 00:01:02
    3. Installing Python 00:04:38
    4. Installing iPython And Using Notebooks 00:06:28
    5. How To Access Your Working Files 00:01:15
  18. Installing Spark
    1. Download And Setup 00:03:24
    2. Running The Spark Shell 00:05:35
    3. Running The Spark Shell With iPython 00:06:38
  19. Spark Fundamentals
    1. What Is A Resilient Distributed Dataset - RDD? 00:04:54
    2. Reading A Text File 00:03:34
    3. Actions 00:02:13
    4. Transformations 00:02:30
    5. Persisting Data 00:04:11
  20. Transformations
    1. Map 00:03:04
    2. Filter 00:03:56
    3. Flatmap 00:03:16
    4. MapPartitions 00:04:07
    5. MapPartitionsWithIndex 00:01:51
    6. Sample 00:02:36
    7. Union 00:01:11
    8. Intersection 00:01:28
    9. Distinct 00:02:02
    10. Cartesian 00:03:17
    11. Pipe 00:03:40
    12. Coalesce 00:02:12
    13. Repartition 00:02:29
    14. RepartitionAndSortWithinPartitions 00:03:58
  21. Actions
    1. Reduce 00:04:19
    2. Collect 00:01:56
    3. Count 00:03:05
    4. First 00:01:20
    5. Take 00:01:05
    6. TakeSample 00:03:03
    7. TakeOrdered 00:02:10
    8. SaveAsTextFile 00:04:09
    9. CountByKey 00:02:40
    10. ForEach 00:03:11
  22. Key-Value Pair RDDs
    1. GroupByKey 00:02:31
    2. ReduceByKey 00:03:30
    3. AggregateByKey 00:03:44
    4. SortByKey 00:02:47
    5. Join 00:04:16
    6. CoGroup 00:02:09
  23. Input And Output
    1. WholeTextFile 00:03:15
    2. Pickle Files 00:03:59
    3. HadoopInputFormat 00:05:35
    4. HadoopOutputFormat 00:05:31
  24. Performance
    1. Broadcast Variables 00:04:17
    2. Accumulators 00:05:08
    3. Using A Custom Accumulator 00:04:52
    4. Partitioning 00:07:56
  25. Running On A Cluster
    1. Spark Standalone Cluster 00:04:26
    2. Mesos 00:03:38
    3. Yarn 00:02:28
    4. Client Versus Cluster Mode 00:02:41
  26. Advanced Spark
    1. Spark Streaming 00:04:21
    2. Dataframes And SQL 00:03:28
    3. MLlib 00:04:29
  27. Conclusion
    1. Resources And Where To Go From Here 00:01:02
    2. Wrap Up 00:01:28
  28. Introduction
    1. Welcome To The Course 00:00:45
    2. About The Author 00:00:28
    3. Course Curriculum Overview 00:00:33
  29. Overview Of Spark 2.0
    1. What Is Spark 00:01:08
    2. Why Spark 2.0 DataFrames 00:00:29
  30. DataFrame Basics
    1. Jupyter Notebook Overview 00:02:57
    2. Python Review Part One 00:08:11
    3. Python Review Part Two 00:08:09
    4. Creating A DataFrame 00:02:29
    5. Data Input 00:03:01
    6. Data Output 00:03:00
    7. Getting DataFrame Information 00:02:17
    8. Selecting Columns And Rows 00:03:00
    9. Creating and Renaming Columns 00:03:57
    10. Using SQL With DataFrames 00:02:27
    11. Filtering The Data 00:04:52
  31. Spark DataFrame Dates And Timestamps
    1. Introduction To Date And Timestamps 00:00:15
    2. Working With Dates 00:04:22
    3. Working With Timestamps 00:04:03
  32. Spark DataFrame Aggregate Operations
    1. Introduction To Aggregate And GroupBy Concepts 00:00:24
    2. Spark GroupBy Method 00:03:08
    3. Spark Built In Aggregate Methods 00:03:25
    4. Sorting And Ordering 00:01:34
  33. Spark DataFrame Working With Missing Data
    1. Introduction To Missing Data 00:00:21
    2. Dropping Data 00:04:25
    3. Filling Missing Data 00:02:46
  34. Spark DataFrame Exercises
    1. Introduction To Exercises 00:00:34
    2. Exercise Solutions 00:04:05
  35. Thank You
    1. What Is Next And Where To Go From Here 00:00:24
  36. Introduction
    1. Welcome To The Course 00:01:32
    2. About The Author 00:01:32
  37. Introducing Apache Spark 2.0
    1. What Is Apache Spark 00:07:40
    2. Getting Started With Apache Spark 00:03:03
    3. Spark Jobs And APIs 00:06:23
  38. Spark 2.0 Simplicity: Unifying Datasets And Dataframes
    1. Unified API And Spark Session 00:06:38
    2. Spark MLlib - A Primer On ML Pipelines 00:07:58
  39. Spark 2.0 Speed: Tungsten Phase 2
    1. Improving Spark Performance With The Push Toward Whole-Stage Code Generation 00:05:51
  40. Spark 2.0 Intelligence: Structured Streaming
    1. Quick Refresh Of Spark Streaming 00:07:08
    2. Introducing Structured Streaming 00:04:40
  41. Conclusion
    1. Wrap Up And Thank You 00:01:02