O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

High Performance Spark

Book Description

If you’ve successfully used Apache Spark to solve medium sized-problems, but still struggle to realize the "Spark promise" of unparalleled performance on big data, this book is for you. High Performance Spark shows you how take advantage of Spark at scale, so you can grow beyond the novice-level. It’s ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications.

Table of Contents

  1. Preface
    1. Who Is This Book For?
    2. Early Release Note
    3. Supporting Books & Materials
    4. Conventions Used in this Book
    5. Using Code Examples
    6. O’Reilly Safari
    7. How to Contact the Authors
    8. How to Contact Us
    9. Acknowledgments
  2. 1. Introduction to High Performance Spark
    1. What is Spark and Why Performance Matters
    2. What You Can Expect to Get from This Book
    3. Spark Versions
    4. Why Scala?
    5. Conclusion
  3. 2. How Spark Works
    1. How Spark Fits into the Big Data Ecosystem
      1. Spark Components
    2. Spark Model of Parallel Computing: RDDs
      1. Lazy Evaluation
      2. In-Memory Persistence and Memory Management
      3. Immutability and the RDD Interface
      4. Types of RDDs
      5. Functions on RDDs: Transformations vs. Actions
      6. Wide vs. Narrow Dependencies
    3. Spark Job Scheduling
      1. Resource Allocation Across Applications
      2. The Spark Application
    4. The Anatomy of a Spark Job
      1. The DAG
      2. Jobs
      3. Stages
      4. Tasks
    5. Conclusion
  4. 3. DataFrames, Datasets & Spark SQL
    1. Getting Started with the SparkSession (or HiveContext or SQLContext)
    2. Spark SQL Dependencies
      1. Managing Spark Dependencies
      2. Avoiding Hive JARs
    3. Basics of Schemas
    4. DataFrame API
      1. Transformations
      2. Multi DataFrame Transformations
      3. Plain Old SQL Queries and Interacting with Hive Data
    5. Data Representation in DataFrames & Datasets
      1. Tungsten
    6. Data Loading and Saving Functions
      1. DataFrameWriter and DataFrameReader
      2. Formats
      3. Save Modes
      4. Partitions (Discovery and Writing)
    7. Datasets
      1. Interoperability with RDDs, DataFrames, and Local Collections
      2. Compile Time Strong Typing
      3. Easier Functional (RDD “like”) Transformations
      4. Relational Transformations
      5. Multi-Dataset Relational Transformations
      6. Grouped Operations on Datasets
    8. Extending with User Defined Functions & Aggregate Functions (UDFs, UDAFs)
    9. Query Optimizer
      1. Logical and Physical Plans
      2. Code Generation
      3. Large Query Plans and Iterative algorithms
    10. Debugging Spark SQL Queries
    11. JDBC/ODBC Server
    12. Conclusion
  5. 4. Joins (SQL & Core)
    1. Core Spark Joins
      1. Choosing a Join Type
      2. Choosing an Execution Plan
    2. Spark SQL Joins
      1. DataFrame Joins
      2. Dataset Joins
    3. Conclusion
  6. 5. Effective Transformations
    1. Narrow vs. Wide Transformations
    2. What Type of RDD Does Your Transformation Return?
    3. Minimizing Object Creation
      1. Reusing Existing Objects
      2. Using Smaller Data Structures
    4. Iterator-to-Iterator Transformations with mapPartitions
      1. What Is an Iterator-To-Iterator Transformation?
      2. Space and Time Advantages
      3. An Example
    5. Set Operations
    6. Reducing Setup Overhead
      1. Shared Variables
      2. Broadcast Variables
      3. Accumulators
    7. Reusing RDDs
      1. Cases For Reuse
      2. Deciding if Recompute is Inexpensive Enough
      3. Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
      4. Tachyon
      5. LRU Caching
      6. Noisy Cluster Considerations
      7. Interaction with Accumulators
    8. Conclusion
  7. 6. Working with Key/Value Data
    1. The Goldilocks Example
      1. Goldilocks Version 0: Iterative Solution
      2. How to Use PairRDDFunctions and OrderedRDDFunctions
    2. Actions on Key/Value Pairs
    3. What’s So Dangerous About the groupByKey Function
      1. Goldilocks Version 1: groupByKey Solution
    4. Choosing an Aggregation Operation
      1. Dictionary of Aggregation Operations with Performance Considerations
    5. Multiple RDD Operations
      1. Co-Grouping
    6. Partitioners and Key/Value Data
      1. Using the Spark Partitioner Object
      2. Hash Partitioning
      3. Range Partitioning
      4. Custom Partitioning
      5. Preserving Partitioning Information Across Transformations
      6. Leveraging Co-Located and Co-Partitioned RDDs
      7. Dictionary of Mapping and Partitioning Functions PairRDDFunctions
    7. Dictionary of OrderedRDDOperations
      1. Sorting by Two Keys with SortByKey
    8. Secondary Sort and repartitionAndSortWithinPartitions
      1. Leveraging repartitionAndSortWithinPartitions for a Group By Key and Sort Values Function
      2. How Not to Sort By Two Orderings
      3. Goldilocks Version 2: Secondary Sort
    9. A Different Approach to Goldilocks
      1. Goldilocks Version 3: Sort on Cell Values
    10. Straggler Detection and Unbalanced Data
      1. Back to Goldilocks (Again)
      2. Goldilocks Version 4: Reduce to Distinct on Each Partition
    11. Conclusion
  8. 7. Going Beyond Scala
    1. Beyond Scala within the JVM
    2. Beyond Scala, and Beyond the JVM
      1. How PySpark Works
      2. How SparkR Works
      3. Spark.jl (Julia Spark)
      4. How Eclair JS Works
      5. Spark on the Common Language Runtime (CLR) - C# and Friends
    3. Calling other languages from Spark
      1. Using Pipe and Friends
      2. JNI
      3. Java Native Access (JNA)
      4. Underneath Everything is FORTRAN
      5. Getting to the GPU
    4. The Future
    5. Conclusion
  9. 8. Testing & Validation
    1. Unit Testing
      1. General Spark Unit Testing
      2. Mocking RDDs
    2. Getting Test Data
      1. Generating Large Data Sets
      2. Sampling
    3. Property Checking with ScalaCheck
      1. Computing RDD Difference
    4. Integration Testing
      1. Choosing Your Integration Testing Environment
    5. Verifying Performance
      1. Spark Counters for Verifying Performance
      2. Projects for Verifying Performance
    6. Job Validation
    7. Conclusion
  10. 9. Spark MLlib and ML
    1. Choosing between Spark MLlib and Spark ML
    2. Working with MLlib
      1. Getting started with MLlib (organization and imports)
      2. MLlib Feature Encoding and Data Preparation
      3. Feature Scaling and Selection
      4. MLlib model training
      5. Predicting
      6. Serving and Persistence
      7. Model evaluation
    3. Working with Spark ML
      1. Spark ML organization and imports
      2. Pipeline Stages
      3. Explain Params
      4. Data Encoding
      5. Data Cleaning
      6. Spark ML Models
      7. Putting it all together in a pipeline
      8. Training a Pipeline
      9. Accessing individual stages
      10. Data Persistence and Spark ML
      11. Extending Spark ML Pipelines with Your Own Algorithms
      12. Model and Pipeline Persistence and Serving with Spark ML
    4. General Serving considerations
    5. Conclusion
  11. 10. Spark Components and Packages
    1. Stream Processing with Spark
      1. Sources and Sinks
      2. Batch Intervals
      3. Data Checkpoint Intervals
      4. Considerations for DStreams
      5. Considerations for Structured Streaming
      6. High Availability Mode (or Handling Driver Failure or Checkpointing)
    2. GraphX
    3. Using Community Packages and libraries
      1. Creating a Spark Package
    4. Conclusion
  12. A. Appendix
    1. Spark Tuning and Cluster Sizing
      1. How to Adjust Spark Settings
      2. How to Determine the Relevant Information About Your Cluster
    2. Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?
      1. How Large to Make the Spark Driver
      2. A Few Large Executors or Many Small Executors?
      3. Many Large Executors
      4. Allocating Cluster Resources and Dynamic Allocation
      5. Dividing the Space Within One Executor
      6. Number and Size of Partitions
    3. Serialization Options
      1. Kryo
    4. Some additional Debugging Techniques
  13. Index