Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

O'Reilly logo
High Performance Spark

Book Description

If you’ve successfully used Apache Spark to solve medium sized-problems, but still struggle to realize the "Spark promise" of unparalleled performance on big data, this book is for you. High Performance Spark shows you how take advantage of Spark at scale, so you can grow beyond the novice-level. It’s ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications.

Table of Contents

  1. Preface
    1. Who Is This Book For?
    2. Early Release Note
    3. Supporting Books & Materials
    4. Conventions Used in this Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact the Authors
    8. How to Contact Us
    9. Acknowledgments
  2. 1. Introduction to High Performance Spark
    1. What is Spark and Why Performance Matters
    2. What You Can Expect to Get from This Book
    3. Spark Versions
    4. Why Scala?
    5. Conclusion
  3. 2. How Spark Works
    1. How Spark Fits into the Big Data Ecosystem
      1. Spark Components
    2. Spark Model of Parallel Computing: RDDs
      1. Lazy Evaluation
      2. In-Memory Persistence and Memory Management
      3. Immutability and the RDD Interface
      4. Types of RDDs
      5. Functions on RDDs: Transformations vs. Actions
      6. Wide vs. Narrow Dependencies
    3. Spark Job Scheduling
      1. Resource Allocation Across Applications
      2. The Spark Application
    4. The Anatomy of a Spark Job
      1. The DAG
      2. Jobs
      3. Stages
      4. Tasks
    5. Conclusion
  4. 3. DataFrames, Datasets & Spark SQL
    1. Getting Started with the SparkSession (or HiveContext or SQLContext)
    2. Spark SQL Dependencies
      1. Avoiding Hive JARs
    3. Basics of Schemas
    4. DataFrame API
      1. Transformations
      2. Multi DataFrame Transformations
      3. Plain Old SQL Queries and Interacting with Hive Data
    5. Data Representation in DataFrames & Datasets
      1. Tungsten
    6. Data Loading and Saving Functions
      1. DataFrameWriter and DataFrameReader
      2. Formats
      3. Save Modes
      4. Partitions (Discovery and Writing)
    7. Datasets
      1. Interoperability with RDDs, DataFrames, and Local Collections
      2. Compile Time Strong Typing
      3. Easier Functional (RDD “like”) Transformations
      4. Relational Transformations
      5. Multi-Dataset Relational Transformations
      6. Grouped Operations on Datasets
    8. Extending with User Defined Functions & Aggregate Functions (UDFs, UDAFs)
    9. Query Optimizer
      1. Logical and Physical Plans
      2. Code Generation
      3. Large Query Plans and Iterative algorithms
    10. Debugging Spark SQL Queries
    11. JDBC/ODBC Server
    12. Conclusion
  5. 4. Joins (SQL & Core)
    1. Core Spark Joins
      1. Choosing a Join Type
      2. Choosing an Execution Plan
    2. Spark SQL Joins
      1. DataFrame Joins
      2. Dataset Joins
    3. Conclusion
  6. 5. Effective Transformations
    1. Narrow vs. Wide Transformations
    2. What Type of RDD Does Your Transformation Return?
    3. Minimizing Object Creation
      1. Reusing Existing Objects
      2. Using Smaller Data Structures
    4. Iterator-to-Iterator Transformations with mapPartitions
      1. What Is an Iterator-To-Iterator Transformation?
      2. Space and Time Advantages
      3. An Example
    5. Set Operations
    6. Reducing Setup Overhead
      1. Shared Variables
      2. Broadcast Variables
      3. Accumulators
    7. Reusing RDDs
      1. Cases For Re-Use
      2. Deciding if Re-Compute is Inexpensive Enough
      3. Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
      4. Tachyon
      5. LRU Caching
      6. Noisy Cluster Considerations
      7. Interaction with Accumulators
    8. Conclusion
  7. 6. Working with Key/Value Data
    1. The Goldilocks Example
      1. Goldilocks Solution Version 0: Iterative Solution
      2. How to Use PairRDDFunctions and OrderedRDDFunctions
    2. Actions on Key/Value Pairs
    3. What’s So Dangerous About the groupByKey Function
      1. Goldilocks Version 1: groupByKey solution
    4. Choosing an Aggregation Operation
      1. Dictionary of Aggregation Operations with Performance Considerations
    5. Multiple RDD Operations
      1. Co-Grouping
    6. Partitioners and Key/Value Data
      1. Using the Spark Partitioner Object
      2. Hash Partitioning
      3. Range Partitioning
      4. Custom Partitioning
      5. Preserving Partitioning Information Across Transformations
      6. Leveraging Co-Located and Co-Partitioned RDDs
      7. Dictionary of Mapping and Partitioning Functions PairRDDFunctions
    7. Dictionary of OrderedRDDOperations
      1. Sorting By Two Keys with SortByKey
    8. Secondary Sort and repartitionAndSortWithinPartitions
      1. Leveraging repartitionAndSortWithinPartitions for a Group By Key and Sort Values Function
      2. How Not to Sort By Two Orderings
      3. Goldilocks Version 2: Secondary Sort
    9. A Different Approach to Goldilocks
      1. Goldilocks Version 3: Sort on Cell Values
    10. Straggler Detection and Unbalanced Data
      1. Back to Goldilocks (Again)
      2. Goldilocks Version 4: Reduce to Distinct on Each Partition
    11. Conclusion
  8. 7. Going Beyond Scala
    1. Beyond Scala within the JVM
    2. Beyond Scala, and beyond the JVM
      1. How PySpark Works
      2. How SparkR Works
      3. Spark.jl (Julia Spark)
      4. How Eclair JS Works
      5. Spark on the CLR (C# and Friends)
    3. Calling other languages from Spark
      1. Using Pipe and Friends
      2. JNI
      3. Java Native Access (JNA)
      4. Underneath Everything is FORTRAN
      5. Getting to the GPU
    4. Memory Overhead Errors
    5. The future
    6. Conclusion