O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Path: Data Science With Apache Spark 2

Video Description

Get started with Spark for data processing and data science

In Detail

Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists.

This Learning Path begins with an introduction to Apache Spark. We first cover the basics of Spark, introduce SparkR, then look at the charting and plotting features of Python in conjunction with Spark data processing, and finally Spark's data processing libraries. We then develop a real-world Spark application. Next, we enable you to become comfortable and confident working with Spark for data science by exploring Spark's data science libraries on a dataset of tweets.

Begin your journey into fast, large-scale, and distributed data processing using Spark with this Learning Path.

Prerequisites: Requires basic knowledge of either Python or R

Resources: Code downloads and errata:

  • Apache Spark 2 for Beginners

  • Data Science with Spark

  • PATH PRODUCTS

    This path navigates across the following products (in sequential order):

  • Apache Spark 2 for Beginners (5h 38m)

  • Data Science with Spark (3h 20m)

  • Table of Contents

    1. Chapter 1 : Apache Spark 2 for Beginners
      1. The Course Overview 00:04:30
      2. An Overview of Apache Hadoop 00:05:50
      3. Understanding Apache Spark 00:05:14
      4. Installing Spark on Your Machines 00:13:49
      5. Functional Programming with Spark and Understanding Spark RDD 00:08:45
      6. Data Transformations and Actions with RDDs 00:05:22
      7. Monitoring with Spark 00:04:02
      8. The Basics of Programming with Spark 00:20:30
      9. Creating RDDs from Files and Understanding the Spark Library Stack 00:06:39
      10. Understanding the Structure of Data and the Need of Spark SQL 00:09:39
      11. Anatomy of Spark SQL 00:05:09
      12. DataFrame Programming 00:12:01
      13. Understanding Aggregations and Multi-Datasource Joining with SparkSQL 00:08:33
      14. Introducing Datasets and Understanding Data Catalogs 00:07:53
      15. The Need for Spark and the Basics of the R Language 00:08:09
      16. DataFrames in R and Spark 00:02:57
      17. Spark DataFrame Programming with R 00:04:43
      18. Understanding Aggregations and Multi- Datasource Joins in SparkR 00:04:12
      19. Charting and Plotting Libraries and Setting Up a Dataset 00:04:00
      20. Charts, Plots, and Histograms 00:05:36
      21. Bar Chart and Pie Chart 00:07:46
      22. Scatter Plot and Line Graph 00:04:53
      23. Data Stream Processing and Micro Batch Data Processing 00:08:36
      24. A Log Event Processor 00:16:22
      25. Windowed Data Processing and More Processing Options 00:07:27
      26. Kafka Stream Processing 00:10:44
      27. Spark Streaming Jobs in Production 00:09:09
      28. Understanding Machine Learning and the Need of Spark for it 00:06:22
      29. Wine Quality Prediction and Model Persistence 00:10:44
      30. Wine Classification 00:05:58
      31. Spam Filtering 00:07:08
      32. Feature Algorithms and Finding Synonyms 00:06:54
      33. Understanding Graphs with Their Usage 00:04:35
      34. The Spark GraphX Library 00:10:09
      35. Graph Processing and Graph Structure Processing 00:09:45
      36. Tennis Tournament Analysis 00:05:34
      37. Applying PageRank Algorithm 00:03:30
      38. Connected Component Algorithm 00:04:39
      39. Understanding GraphFrames and Its Queries 00:09:31
      40. Lambda Architecture 00:04:47
      41. Micro Blogging with Lambda Architecture 00:07:13
      42. Implementing Lambda Architecture and Working with Spark Applications 00:08:19
      43. Coding Style, Setting Up the Source Code, and Understanding Data Ingestion 00:09:09
      44. Generating Purposed Views and Queries 00:05:53
      45. Understanding Custom Data Processes 00:06:12
    2. Chapter 2 : Data Science with Spark
      1. The Course Overview 00:03:55
      2. Spark: Origins and Ecosystem for Big Data Scientists, the Scala, Python, and R flavors 00:04:41
      3. Install Spark on Your Laptop with Docker, or Scale Fast in the Cloud 00:04:41
      4. Apache Zeppelin, a Web-Based Notebook for Spark with matplotlib and ggplot2 00:03:08
      5. Manipulating Data with the Core RDD API 00:08:16
      6. Using Dataframe, Dataset, and SQL – Natural and Easy! 00:06:36
      7. Manipulating Rows and Columns 00:04:50
      8. Dealing with File Format 00:02:17
      9. Visualizing More – ggplot2, matplotlib, and Angular.js at the Rescue 00:03:32
      10. Discovering spark.ml and spark.mllib - and Other Libraries 00:08:02
      11. Wrapping Up Basic Statistics and Linear Algebra 00:09:58
      12. Cleansing Data and Engineering the Features 00:05:04
      13. Reducing the Dimensionality 00:04:09
      14. Pipeline for a Life 00:03:58
      15. Streaming Tweets to Disk 00:05:37
      16. Streaming Tweets on a Map 00:04:05
      17. Cleansing and Building Your Reference Dataset 00:05:13
      18. Querying and Visualizing Tweets with SQL 00:04:16
      19. Indicators, Correlations, and Sampling 00:07:17
      20. Validating Statistical Relevance 00:03:32
      21. Running SVD and PCA 00:04:04
      22. Extending the Basic Statistics for Your Needs 00:04:19
      23. Analyzing Free Text from the Tweets 00:07:23
      24. Dealing with Stemming, Syntax, Idioms and Hashtags 00:05:24
      25. Detecting Tweet Sentiment 00:03:28
      26. Identifying Topics with LDA 00:03:06
      27. Word Cloudify Your Dataset 00:05:31
      28. Locating Users and Displaying Heatmaps with GeoHash 00:04:15
      29. Collaborating on the Same Note with Peers 00:04:57
      30. Create Visual Dashboards for Your Business Stakeholders 00:03:56
      31. Building the Training and Test Datasets 00:07:25
      32. Training a Logistic Regression Model 00:03:55
      33. Evaluating Your Classifier 00:05:32
      34. Selecting Your Model 00:05:19
      35. Clustering Users by Followers and Friends 00:05:12
      36. Clustering Users by Location 00:02:48
      37. Running KMeans on a Stream 00:02:30
      38. Recommending Similar Users 00:05:11
      39. Analyzing Mentions with GraphX 00:06:22
      40. Where to Go from Here 00:06:21