Cover image for Learning Spark

Book Description

Data in all domains is getting bigger. How can you work with it efficiently? This book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. Written by the developers of Spark, this book will have data scientists and engineers up and running in no time.

Table of Contents

  1. Foreword
  2. Preface
    1. Audience
    2. How This Book Is Organized
    3. Supporting Books
    4. Conventions Used in This Book
    5. Code Examples
    6. Safari® Books Online
    7. How to Contact Us
    8. Acknowledgments
  3. 1. Introduction to Data Analysis with Spark
    1. What Is Apache Spark?
    2. A Unified Stack
      1. Spark Core
      2. Spark SQL
      3. Spark Streaming
      4. MLlib
      5. GraphX
      6. Cluster Managers
    3. Who Uses Spark, and for What?
      1. Data Science Tasks
      2. Data Processing Applications
    4. A Brief History of Spark
    5. Spark Versions and Releases
    6. Storage Layers for Spark
  4. 2. Downloading Spark and Getting Started
    1. Downloading Spark
    2. Introduction to Spark’s Python and Scala Shells
    3. Introduction to Core Spark Concepts
    4. Standalone Applications
      1. Initializing a SparkContext
      2. Building Standalone Applications
    5. Conclusion
  5. 3. Programming with RDDs
    1. RDD Basics
    2. Creating RDDs
    3. RDD Operations
      1. Transformations
      2. Actions
      3. Lazy Evaluation
    4. Passing Functions to Spark
      1. Python
      2. Scala
      3. Java
    5. Common Transformations and Actions
      1. Basic RDDs
      2. Converting Between RDD Types
    6. Persistence (Caching)
    7. Conclusion
  6. 4. Working with Key/Value Pairs
    1. Motivation
    2. Creating Pair RDDs
    3. Transformations on Pair RDDs
      1. Aggregations
      2. Grouping Data
      3. Joins
      4. Sorting Data
    4. Actions Available on Pair RDDs
    5. Data Partitioning (Advanced)
      1. Determining an RDD’s Partitioner
      2. Operations That Benefit from Partitioning
      3. Operations That Affect Partitioning
      4. Example: PageRank
      5. Custom Partitioners
    6. Conclusion
  7. 5. Loading and Saving Your Data
    1. Motivation
    2. File Formats
      1. Text Files
      2. JSON
      3. Comma-Separated Values and Tab-Separated Values
      4. SequenceFiles
      5. Object Files
      6. Hadoop Input and Output Formats
      7. File Compression
    3. Filesystems
      1. Local/“Regular” FS
      2. Amazon S3
      3. HDFS
    4. Structured Data with Spark SQL
      1. Apache Hive
      2. JSON
    5. Databases
      1. Java Database Connectivity
      2. Cassandra
      3. HBase
      4. Elasticsearch
    6. Conclusion
  8. 6. Advanced Spark Programming
    1. Introduction
    2. Accumulators
      1. Accumulators and Fault Tolerance
      2. Custom Accumulators
    3. Broadcast Variables
      1. Optimizing Broadcasts
    4. Working on a Per-Partition Basis
    5. Piping to External Programs
    6. Numeric RDD Operations
    7. Conclusion
  9. 7. Running on a Cluster
    1. Introduction
    2. Spark Runtime Architecture
      1. The Driver
      2. Executors
      3. Cluster Manager
      4. Launching a Program
      5. Summary
    3. Deploying Applications with spark-submit
    4. Packaging Your Code and Dependencies
      1. A Java Spark Application Built with Maven
      2. A Scala Spark Application Built with sbt
      3. Dependency Conflicts
    5. Scheduling Within and Between Spark Applications
    6. Cluster Managers
      1. Standalone Cluster Manager
      2. Hadoop YARN
      3. Apache Mesos
      4. Amazon EC2
    7. Which Cluster Manager to Use?
    8. Conclusion
  10. 8. Tuning and Debugging Spark
    1. Configuring Spark with SparkConf
    2. Components of Execution: Jobs, Tasks, and Stages
    3. Finding Information
      1. Spark Web UI
      2. Driver and Executor Logs
    4. Key Performance Considerations
      1. Level of Parallelism
      2. Serialization Format
      3. Memory Management
      4. Hardware Provisioning
    5. Conclusion
  11. 9. Spark SQL
    1. Linking with Spark SQL
    2. Using Spark SQL in Applications
      1. Initializing Spark SQL
      2. Basic Query Example
      3. SchemaRDDs
      4. Caching
    3. Loading and Saving Data
      1. Apache Hive
      2. Parquet
      3. JSON
      4. From RDDs
    4. JDBC/ODBC Server
      1. Working with Beeline
      2. Long-Lived Tables and Queries
    5. User-Defined Functions
      1. Spark SQL UDFs
      2. Hive UDFs
    6. Spark SQL Performance
      1. Performance Tuning Options
    7. Conclusion
  12. 10. Spark Streaming
    1. A Simple Example
    2. Architecture and Abstraction
    3. Transformations
      1. Stateless Transformations
      2. Stateful Transformations
    4. Output Operations
    5. Input Sources
      1. Core Sources
      2. Additional Sources
      3. Multiple Sources and Cluster Sizing
    6. 24/7 Operation
      1. Checkpointing
      2. Driver Fault Tolerance
      3. Worker Fault Tolerance
      4. Receiver Fault Tolerance
      5. Processing Guarantees
    7. Streaming UI
    8. Performance Considerations
      1. Batch and Window Sizes
      2. Level of Parallelism
      3. Garbage Collection and Memory Usage
    9. Conclusion
  13. 11. Machine Learning with MLlib
    1. Overview
    2. System Requirements
    3. Machine Learning Basics
      1. Example: Spam Classification
    4. Data Types
      1. Working with Vectors
    5. Algorithms
      1. Feature Extraction
      2. Statistics
      3. Classification and Regression
      4. Clustering
      5. Collaborative Filtering and Recommendation
      6. Dimensionality Reduction
      7. Model Evaluation
    6. Tips and Performance Considerations
      1. Preparing Features
      2. Configuring Algorithms
      3. Caching RDDs to Reuse
      4. Recognizing Sparsity
      5. Level of Parallelism
    7. Pipeline API
    8. Conclusion
  14. Index