O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Analytics with Spark Using Python, First edition

Book Description

Spark for Data Professionals introduces and solidifies the concepts behind Spark 2.x, teaching working developers, architects, and data professionals exactly how to build practical Spark solutions. Jeffrey Aven covers all aspects of Spark development, including basic programming to SparkSQL, SparkR, Spark Streaming, Messaging, NoSQL and Hadoop integration. Each chapter presents practical exercises deploying Spark to your local or cloud environment, plus programming exercises for building real applications. Unlike other Spark guides, Spark for Data Professionals explains crucial concepts step-by-step, assuming no extensive background as an open source developer. It provides a complete foundation for quickly progressing to more advanced data science and machine learning topics. This guide will help you:

  • Understand Spark basics that will make you a better programmer and cluster “citizen”
  • Master Spark programming techniques that maximize your productivity
  • Choose the right approach for each problem
  • Make the most of built-in platform constructs, including broadcast variables, accumulators, effective partitioning, caching, and checkpointing
  • Leverage powerful tools for managing streaming, structured, semi-structured, and unstructured data

Table of Contents

  1. Cover Page
  2. Title Page
  3. Copyright Page
  4. Contents at a Glance
  5. Table of Contents
  6. About This E-Book
  7. Preface
  8. Introduction
  9. I: Spark Foundations
    1. 1 Introducing Big Data, Hadoop, and Spark
      1. Introduction to Big Data, Distributed Computing, and Hadoop
        1. A Brief History of Big Data and Hadoop
        2. Hadoop Explained
      2. Introduction to Apache Spark
        1. Apache Spark Background
        2. Uses for Spark
        3. Programming Interfaces to Spark
        4. Submission Types for Spark Programs
        5. Input/Output Types for Spark Applications
        6. The Spark RDD
        7. Spark and Hadoop
      3. Functional Programming Using Python
        1. Data Structures Used in Functional Python Programming
        2. Python Object Serialization
        3. Python Functional Programming Basics
      4. Summary
    2. 2 Deploying Spark
      1. Spark Deployment Modes
        1. Local Mode
        2. Spark Standalone
        3. Spark on YARN
        4. Spark on Mesos
      2. Preparing to Install Spark
      3. Getting Spark
      4. Installing Spark on Linux or Mac OS X
      5. Installing Spark on Windows
      6. Exploring the Spark Installation
      7. Deploying a Multi-Node Spark Standalone Cluster
      8. Deploying Spark in the Cloud
        1. Amazon Web Services (AWS)
        2. Google Cloud Platform (GCP)
        3. Databricks
      9. Summary
    3. 3 Understanding the Spark Cluster Architecture
      1. Anatomy of a Spark Application
        1. Spark Driver
        2. Spark Workers and Executors
        3. The Spark Master and Cluster Manager
      2. Spark Applications Using the Standalone Scheduler
        1. Spark Applications Running on YARN
      3. Deployment Modes for Spark Applications Running on YARN
        1. Client Mode
        2. Cluster Mode
        3. Local Mode Revisited
      4. Summary
    4. 4 Learning Spark Programming Basics
      1. Introduction to RDDs
      2. Loading Data into RDDs
        1. Creating an RDD from a File or Files
        2. Methods for Creating RDDs from a Text File or Files
        3. Creating an RDD from an Object File
        4. Creating an RDD from a Data Source
        5. Creating RDDs from JSON Files
        6. Creating an RDD Programmatically
      3. Operations on RDDs
        1. Key RDD Concepts
        2. Basic RDD Transformations
        3. Basic RDD Actions
        4. Transformations on PairRDDs
        5. MapReduce and Word Count Exercise
        6. Join Transformations
        7. Joining Datasets in Spark
        8. Transformations on Sets
        9. Transformations on Numeric RDDs
      4. Summary
  10. II: Beyond the Basics
    1. 5 Advanced Programming Using the Spark Core API
      1. Shared Variables in Spark
        1. Broadcast Variables
        2. Accumulators
        3. Exercise: Using Broadcast Variables and Accumulators
      2. Partitioning Data in Spark
        1. Partitioning Overview
        2. Controlling Partitions
        3. Repartitioning Functions
        4. Partition-Specific or Partition-Aware API Methods
      3. RDD Storage Options
        1. RDD Lineage Revisited
        2. RDD Storage Options
        3. RDD Caching
        4. Persisting RDDs
        5. Choosing When to Persist or Cache RDDs
        6. Checkpointing RDDs
        7. Exercise: Checkpointing RDDs
      4. Processing RDDs with External Programs
      5. Data Sampling with Spark
      6. Understanding Spark Application and Cluster Configuration
        1. Spark Environment Variables
        2. Spark Configuration Properties
      7. Optimizing Spark
        1. Filter Early, Filter Often
        2. Optimizing Associative Operations
        3. Understanding the Impact of Functions and Closures
        4. Considerations for Collecting Data
        5. Configuration Parameters for Tuning and Optimizing Applications
        6. Avoiding Inefficient Partitioning
        7. Diagnosing Application Performance Issues
      8. Summary
    2. 6 SQL and NoSQL Programming with Spark
      1. Introduction to Spark SQL
        1. Introduction to Hive
        2. Spark SQL Architecture
        3. Getting Started with DataFrames
        4. Using DataFrames
        5. Caching, Persisting, and Repartitioning DataFrames
        6. Saving DataFrame Output
        7. Accessing Spark SQL
        8. Exercise: Using Spark SQL
      2. Using Spark with NoSQL Systems
        1. Introduction to NoSQL
        2. Using Spark with HBase
        3. Exercise: Using Spark with HBase
        4. Using Spark with Cassandra
        5. Using Spark with DynamoDB
        6. Other NoSQL Platforms
      3. Summary
    3. 7 Stream Processing and Messaging Using Spark
      1. Introducing Spark Streaming
        1. Spark Streaming Architecture
        2. Introduction to DStreams
        3. Exercise: Getting Started with Spark Streaming
        4. State Operations
        5. Sliding Window Operations
      2. Structured Streaming
        1. Structured Streaming Data Sources
        2. Structured Streaming Data Sinks
        3. Output Modes
        4. Structured Streaming Operations
      3. Using Spark with Messaging Platforms
        1. Apache Kafka
        2. Exercise: Using Spark with Kafka
        3. Amazon Kinesis
      4. Summary
    4. 8 Introduction to Data Science and Machine Learning Using Spark
      1. Spark and R
        1. Introduction to R
        2. Using Spark with R
        3. Exercise: Using RStudio with SparkR
      2. Machine Learning with Spark
        1. Machine Learning Primer
        2. Machine Learning Using Spark MLlib
        3. Exercise: Implementing a Recommender Using Spark MLlib
        4. Machine Learning Using Spark ML
      3. Using Notebooks with Spark
        1. Using Jupyter (IPython) Notebooks with Spark
        2. Using Apache Zeppelin Notebooks with Spark
      4. Summary
  11. Index
  12. Code Snippets