O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Spark: The Definitive Guide

Book Description

Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.

You’ll explore the basic operations and common functions of Spark’s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Spark’s scalable machine-learning library.

  • Get a gentle overview of big data and Spark
  • Learn about DataFrames, SQL, and Datasets—Spark’s core APIs—through worked examples
  • Dive into Spark’s low-level APIs, RDDs, and execution of SQL and DataFrames
  • Understand how Spark runs on a cluster
  • Debug, monitor, and tune Spark clusters and applications
  • Learn the power of Structured Streaming, Spark’s stream-processing engine
  • Learn how you can apply MLlib to a variety of problems, including classification or recommendation

Table of Contents

  1. Preface
    1. About the Authors
    2. Who This Book Is For
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Safari
    6. How to Contact Us
    7. Acknowledgments
  2. I. Gentle Overview of Big Data and Spark
  3. 1. What Is Apache Spark?
    1. Apache Spark’s Philosophy
    2. Context: The Big Data Problem
    3. History of Spark
    4. The Present and Future of Spark
    5. Running Spark
      1. Downloading Spark Locally
      2. Launching Spark’s Interactive Consoles
      3. Running Spark in the Cloud
      4. Data Used in This Book
  4. 2. A Gentle Introduction to Spark
    1. Spark’s Basic Architecture
      1. Spark Applications
    2. Spark’s Language APIs
    3. Spark’s APIs
    4. Starting Spark
    5. The SparkSession
    6. DataFrames
      1. Partitions
    7. Transformations
      1. Lazy Evaluation
    8. Actions
    9. Spark UI
    10. An End-to-End Example
      1. DataFrames and SQL
    11. Conclusion
  5. 3. A Tour of Spark’s Toolset
    1. Running Production Applications
    2. Datasets: Type-Safe Structured APIs
    3. Structured Streaming
    4. Machine Learning and Advanced Analytics
    5. Lower-Level APIs
    6. SparkR
    7. Spark’s Ecosystem and Packages
    8. Conclusion
  6. II. Structured APIs—DataFrames, SQL, and Datasets
  7. 4. Structured API Overview
    1. DataFrames and Datasets
    2. Schemas
    3. Overview of Structured Spark Types
      1. DataFrames Versus Datasets
      2. Columns
      3. Rows
      4. Spark Types
    4. Overview of Structured API Execution
      1. Logical Planning
      2. Physical Planning
      3. Execution
    5. Conclusion
  8. 5. Basic Structured Operations
    1. Schemas
    2. Columns and Expressions
      1. Columns
      2. Expressions
    3. Records and Rows
      1. Creating Rows
    4. DataFrame Transformations
      1. Creating DataFrames
      2. select and selectExpr
      3. Converting to Spark Types (Literals)
      4. Adding Columns
      5. Renaming Columns
      6. Reserved Characters and Keywords
      7. Case Sensitivity
      8. Removing Columns
      9. Changing a Column’s Type (cast)
      10. Filtering Rows
      11. Getting Unique Rows
      12. Random Samples
      13. Random Splits
      14. Concatenating and Appending Rows (Union)
      15. Sorting Rows
      16. Limit
      17. Repartition and Coalesce
      18. Collecting Rows to the Driver
    5. Conclusion
  9. 6. Working with Different Types of Data
    1. Where to Look for APIs
    2. Converting to Spark Types
    3. Working with Booleans
    4. Working with Numbers
    5. Working with Strings
      1. Regular Expressions
    6. Working with Dates and Timestamps
    7. Working with Nulls in Data
      1. Coalesce
      2. ifnull, nullIf, nvl, and nvl2
      3. drop
      4. fill
      5. replace
    8. Ordering
    9. Working with Complex Types
      1. Structs
      2. Arrays
      3. split
      4. Array Length
      5. array_contains
      6. explode
      7. Maps
    10. Working with JSON
    11. User-Defined Functions
    12. Conclusion
  10. 7. Aggregations
    1. Aggregation Functions
      1. count
      2. countDistinct
      3. approx_count_distinct
      4. first and last
      5. min and max
      6. sum
      7. sumDistinct
      8. avg
      9. Variance and Standard Deviation
      10. skewness and kurtosis
      11. Covariance and Correlation
      12. Aggregating to Complex Types
    2. Grouping
      1. Grouping with Expressions
      2. Grouping with Maps
    3. Window Functions
    4. Grouping Sets
      1. Rollups
      2. Cube
      3. Grouping Metadata
      4. Pivot
    5. User-Defined Aggregation Functions
    6. Conclusion
  11. 8. Joins
    1. Join Expressions
    2. Join Types
    3. Inner Joins
    4. Outer Joins
    5. Left Outer Joins
    6. Right Outer Joins
    7. Left Semi Joins
    8. Left Anti Joins
    9. Natural Joins
    10. Cross (Cartesian) Joins
    11. Challenges When Using Joins
      1. Joins on Complex Types
      2. Handling Duplicate Column Names
    12. How Spark Performs Joins
      1. Communication Strategies
    13. Conclusion
  12. 9. Data Sources
    1. The Structure of the Data Sources API
      1. Read API Structure
      2. Basics of Reading Data
      3. Write API Structure
      4. Basics of Writing Data
    2. CSV Files
      1. CSV Options
      2. Reading CSV Files
      3. Writing CSV Files
    3. JSON Files
      1. JSON Options
      2. Reading JSON Files
      3. Writing JSON Files
    4. Parquet Files
      1. Reading Parquet Files
      2. Writing Parquet Files
    5. ORC Files
      1. Reading Orc Files
      2. Writing Orc Files
    6. SQL Databases
      1. Reading from SQL Databases
      2. Query Pushdown
      3. Writing to SQL Databases
    7. Text Files
      1. Reading Text Files
      2. Writing Text Files
    8. Advanced I/O Concepts
      1. Splittable File Types and Compression
      2. Reading Data in Parallel
      3. Writing Data in Parallel
      4. Writing Complex Types
      5. Managing File Size
    9. Conclusion
  13. 10. Spark SQL
    1. What Is SQL?
    2. Big Data and SQL: Apache Hive
    3. Big Data and SQL: Spark SQL
      1. Spark’s Relationship to Hive
    4. How to Run Spark SQL Queries
      1. Spark SQL CLI
      2. Spark’s Programmatic SQL Interface
      3. SparkSQL Thrift JDBC/ODBC Server
    5. Catalog
    6. Tables
      1. Spark-Managed Tables
      2. Creating Tables
      3. Creating External Tables
      4. Inserting into Tables
      5. Describing Table Metadata
      6. Refreshing Table Metadata
      7. Dropping Tables
      8. Caching Tables
    7. Views
      1. Creating Views
      2. Dropping Views
    8. Databases
      1. Creating Databases
      2. Setting the Database
      3. Dropping Databases
    9. Select Statements
      1. case…when…then Statements
    10. Advanced Topics
      1. Complex Types
      2. Functions
      3. Subqueries
    11. Miscellaneous Features
      1. Configurations
      2. Setting Configuration Values in SQL
    12. Conclusion
  14. 11. Datasets
    1. When to Use Datasets
    2. Creating Datasets
      1. In Java: Encoders
      2. In Scala: Case Classes
    3. Actions
    4. Transformations
      1. Filtering
      2. Mapping
    5. Joins
    6. Grouping and Aggregations
    7. Conclusion
  15. III. Low-Level APIs
  16. 12. Resilient Distributed Datasets (RDDs)
    1. What Are the Low-Level APIs?
      1. When to Use the Low-Level APIs?
      2. How to Use the Low-Level APIs?
    2. About RDDs
      1. Types of RDDs
      2. When to Use RDDs?
      3. Datasets and RDDs of Case Classes
    3. Creating RDDs
      1. Interoperating Between DataFrames, Datasets, and RDDs
      2. From a Local Collection
      3. From Data Sources
    4. Manipulating RDDs
    5. Transformations
      1. distinct
      2. filter
      3. map
      4. sort
      5. Random Splits
    6. Actions
      1. reduce
      2. count
      3. first
      4. max and min
      5. take
    7. Saving Files
      1. saveAsTextFile
      2. SequenceFiles
      3. Hadoop Files
    8. Caching
    9. Checkpointing
    10. Pipe RDDs to System Commands
      1. mapPartitions
      2. foreachPartition
      3. glom
    11. Conclusion
  17. 13. Advanced RDDs
    1. Key-Value Basics (Key-Value RDDs)
      1. keyBy
      2. Mapping over Values
      3. Extracting Keys and Values
      4. lookup
      5. sampleByKey
    2. Aggregations
      1. countByKey
      2. Understanding Aggregation Implementations
      3. Other Aggregation Methods
    3. CoGroups
    4. Joins
      1. Inner Join
      2. zips
    5. Controlling Partitions
      1. coalesce
      2. repartition
      3. repartitionAndSortWithinPartitions
      4. Custom Partitioning
    6. Custom Serialization
    7. Conclusion
  18. 14. Distributed Shared Variables
    1. Broadcast Variables
    2. Accumulators
      1. Basic Example
      2. Custom Accumulators
    3. Conclusion
  19. IV. Production Applications
  20. 15. How Spark Runs on a Cluster
    1. The Architecture of a Spark Application
      1. Execution Modes
    2. The Life Cycle of a Spark Application (Outside Spark)
      1. Client Request
      2. Launch
      3. Execution
      4. Completion
    3. The Life Cycle of a Spark Application (Inside Spark)
      1. The SparkSession
      2. Logical Instructions
      3. A Spark Job
      4. Stages
      5. Tasks
    4. Execution Details
      1. Pipelining
      2. Shuffle Persistence
    5. Conclusion
  21. 16. Developing Spark Applications
    1. Writing Spark Applications
      1. A Simple Scala-Based App
      2. Writing Python Applications
      3. Writing Java Applications
    2. Testing Spark Applications
      1. Strategic Principles
      2. Tactical Takeaways
      3. Connecting to Unit Testing Frameworks
      4. Connecting to Data Sources
    3. The Development Process
    4. Launching Applications
      1. Application Launch Examples
    5. Configuring Applications
      1. The SparkConf
      2. Application Properties
      3. Runtime Properties
      4. Execution Properties
      5. Configuring Memory Management
      6. Configuring Shuffle Behavior
      7. Environmental Variables
      8. Job Scheduling Within an Application
    6. Conclusion
  22. 17. Deploying Spark
    1. Where to Deploy Your Cluster to Run Spark Applications
      1. On-Premises Cluster Deployments
      2. Spark in the Cloud
    2. Cluster Managers
      1. Standalone Mode
      2. Spark on YARN
      3. Configuring Spark on YARN Applications
      4. Spark on Mesos
      5. Secure Deployment Configurations
      6. Cluster Networking Configurations
      7. Application Scheduling
    3. Miscellaneous Considerations
    4. Conclusion
  23. 18. Monitoring and Debugging
    1. The Monitoring Landscape
    2. What to Monitor
      1. Driver and Executor Processes
      2. Queries, Jobs, Stages, and Tasks
    3. Spark Logs
    4. The Spark UI
      1. Spark REST API
      2. Spark UI History Server
    5. Debugging and Spark First Aid
      1. Spark Jobs Not Starting
      2. Errors Before Execution
      3. Errors During Execution
      4. Slow Tasks or Stragglers
      5. Slow Aggregations
      6. Slow Joins
      7. Slow Reads and Writes
      8. Driver OutOfMemoryError or Driver Unresponsive
      9. Executor OutOfMemoryError or Executor Unresponsive
      10. Unexpected Nulls in Results
      11. No Space Left on Disk Errors
      12. Serialization Errors
    6. Conclusion
  24. 19. Performance Tuning
    1. Indirect Performance Enhancements
      1. Design Choices
      2. Object Serialization in RDDs
      3. Cluster Configurations
      4. Scheduling
      5. Data at Rest
      6. Shuffle Configurations
      7. Memory Pressure and Garbage Collection
    2. Direct Performance Enhancements
      1. Parallelism
      2. Improved Filtering
      3. Repartitioning and Coalescing
      4. User-Defined Functions (UDFs)
      5. Temporary Data Storage (Caching)
      6. Joins
      7. Aggregations
      8. Broadcast Variables
    3. Conclusion
  25. V. Streaming
  26. 20. Stream Processing Fundamentals
    1. What Is Stream Processing?
      1. Stream Processing Use Cases
      2. Advantages of Stream Processing
      3. Challenges of Stream Processing
    2. Stream Processing Design Points
      1. Record-at-a-Time Versus Declarative APIs
      2. Event Time Versus Processing Time
      3. Continuous Versus Micro-Batch Execution
    3. Spark’s Streaming APIs
      1. The DStream API
      2. Structured Streaming
    4. Conclusion
  27. 21. Structured Streaming Basics
    1. Structured Streaming Basics
    2. Core Concepts
      1. Transformations and Actions
      2. Input Sources
      3. Sinks
      4. Output Modes
      5. Triggers
      6. Event-Time Processing
    3. Structured Streaming in Action
    4. Transformations on Streams
      1. Selections and Filtering
      2. Aggregations
      3. Joins
    5. Input and Output
      1. Where Data Is Read and Written (Sources and Sinks)
      2. Reading from the Kafka Source
      3. Writing to the Kafka Sink
      4. How Data Is Output (Output Modes)
      5. When Data Is Output (Triggers)
    6. Streaming Dataset API
    7. Conclusion
  28. 22. Event-Time and Stateful Processing
    1. Event Time
    2. Stateful Processing
    3. Arbitrary Stateful Processing
    4. Event-Time Basics
    5. Windows on Event Time
      1. Tumbling Windows
      2. Handling Late Data with Watermarks
    6. Dropping Duplicates in a Stream
    7. Arbitrary Stateful Processing
      1. Time-Outs
      2. Output Modes
      3. mapGroupsWithState
      4. flatMapGroupsWithState
    8. Conclusion
  29. 23. Structured Streaming in Production
    1. Fault Tolerance and Checkpointing
    2. Updating Your Application
      1. Updating Your Streaming Application Code
      2. Updating Your Spark Version
      3. Sizing and Rescaling Your Application
    3. Metrics and Monitoring
      1. Query Status
      2. Recent Progress
      3. Spark UI
    4. Alerting
    5. Advanced Monitoring with the Streaming Listener
    6. Conclusion
  30. VI. Advanced Analytics and Machine Learning
  31. 24. Advanced Analytics and Machine Learning Overview
    1. A Short Primer on Advanced Analytics
      1. Supervised Learning
      2. Recommendation
      3. Unsupervised Learning
      4. Graph Analytics
      5. The Advanced Analytics Process
    2. Spark’s Advanced Analytics Toolkit
      1. What Is MLlib?
    3. High-Level MLlib Concepts
    4. MLlib in Action
      1. Feature Engineering with Transformers
      2. Estimators
      3. Pipelining Our Workflow
      4. Training and Evaluation
      5. Persisting and Applying Models
    5. Deployment Patterns
    6. Conclusion
  32. 25. Preprocessing and Feature Engineering
    1. Formatting Models According to Your Use Case
    2. Transformers
    3. Estimators for Preprocessing
      1. Transformer Properties
    4. High-Level Transformers
      1. RFormula
      2. SQL Transformers
      3. VectorAssembler
    5. Working with Continuous Features
      1. Bucketing
      2. Scaling and Normalization
      3. StandardScaler
    6. Working with Categorical Features
      1. StringIndexer
      2. Converting Indexed Values Back to Text
      3. Indexing in Vectors
      4. One-Hot Encoding
    7. Text Data Transformers
      1. Tokenizing Text
      2. Removing Common Words
      3. Creating Word Combinations
      4. Converting Words into Numerical Representations
      5. Word2Vec
    8. Feature Manipulation
      1. PCA
      2. Interaction
      3. Polynomial Expansion
    9. Feature Selection
      1. ChiSqSelector
    10. Advanced Topics
      1. Persisting Transformers
    11. Writing a Custom Transformer
    12. Conclusion
  33. 26. Classification
    1. Use Cases
    2. Types of Classification
      1. Binary Classification
      2. Multiclass Classification
      3. Multilabel Classification
    3. Classification Models in MLlib
      1. Model Scalability
    4. Logistic Regression
      1. Model Hyperparameters
      2. Training Parameters
      3. Prediction Parameters
      4. Example
      5. Model Summary
    5. Decision Trees
      1. Model Hyperparameters
      2. Training Parameters
      3. Prediction Parameters
    6. Random Forest and Gradient-Boosted Trees
      1. Model Hyperparameters
      2. Training Parameters
      3. Prediction Parameters
    7. Naive Bayes
      1. Model Hyperparameters
      2. Training Parameters
      3. Prediction Parameters
    8. Evaluators for Classification and Automating Model Tuning
    9. Detailed Evaluation Metrics
    10. One-vs-Rest Classifier
    11. Multilayer Perceptron
    12. Conclusion
  34. 27. Regression
    1. Use Cases
    2. Regression Models in MLlib
      1. Model Scalability
    3. Linear Regression
      1. Model Hyperparameters
      2. Training Parameters
      3. Example
      4. Training Summary
    4. Generalized Linear Regression
      1. Model Hyperparameters
      2. Training Parameters
      3. Prediction Parameters
      4. Example
      5. Training Summary
    5. Decision Trees
      1. Model Hyperparameters
      2. Training Parameters
      3. Example
    6. Random Forests and Gradient-Boosted Trees
      1. Model Hyperparameters
      2. Training Parameters
      3. Example
    7. Advanced Methods
      1. Survival Regression (Accelerated Failure Time)
      2. Isotonic Regression
    8. Evaluators and Automating Model Tuning
    9. Metrics
    10. Conclusion
  35. 28. Recommendation
    1. Use Cases
    2. Collaborative Filtering with Alternating Least Squares
      1. Model Hyperparameters
      2. Training Parameters
      3. Prediction Parameters
      4. Example
    3. Evaluators for Recommendation
    4. Metrics
      1. Regression Metrics
      2. Ranking Metrics
    5. Frequent Pattern Mining
    6. Conclusion
  36. 29. Unsupervised Learning
    1. Use Cases
    2. Model Scalability
    3. k-means
      1. Model Hyperparameters
      2. Training Parameters
      3. Example
      4. k-means Metrics Summary
    4. Bisecting k-means
      1. Model Hyperparameters
      2. Training Parameters
      3. Example
      4. Bisecting k-means Summary
    5. Gaussian Mixture Models
      1. Model Hyperparameters
      2. Training Parameters
      3. Example
      4. Gaussian Mixture Model Summary
    6. Latent Dirichlet Allocation
      1. Model Hyperparameters
      2. Training Parameters
      3. Prediction Parameters
      4. Example
    7. Conclusion
  37. 30. Graph Analytics
    1. Building a Graph
    2. Querying the Graph
      1. Subgraphs
    3. Motif Finding
    4. Graph Algorithms
      1. PageRank
      2. In-Degree and Out-Degree Metrics
      3. Breadth-First Search
      4. Connected Components
      5. Strongly Connected Components
      6. Advanced Tasks
    5. Conclusion
  38. 31. Deep Learning
    1. What Is Deep Learning?
    2. Ways of Using Deep Learning in Spark
    3. Deep Learning Libraries
      1. MLlib Neural Network Support
      2. TensorFrames
      3. BigDL
      4. TensorFlowOnSpark
      5. DeepLearning4J
      6. Deep Learning Pipelines
    4. A Simple Example with Deep Learning Pipelines
      1. Setup
      2. Images and DataFrames
      3. Transfer Learning
      4. Applying Popular Models
    5. Conclusion
  39. VII. Ecosystem
  40. 32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
    1. PySpark
      1. Fundamental PySpark Differences
      2. Pandas Integration
    2. R on Spark
      1. SparkR
      2. sparklyr
    3. Conclusion
  41. 33. Ecosystem and Community
    1. Spark Packages
      1. An Abridged List of Popular Packages
      2. Using Spark Packages
      3. External Packages
    2. Community
      1. Spark Summit
      2. Local Meetups
    3. Conclusion
  42. Index