You are previewing High Performance Spark.
O'Reilly logo
High Performance Spark

Book Description

If you’ve successfully used Apache Spark to solve medium sized-problems, but still struggle to realize the "Spark promise" of unparalleled performance on big data, this book is for you. High Performance Spark shows you how take advantage of Spark at scale, so you can grow beyond the novice-level. It’s ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications.

Table of Contents

  1. Preface
    1. Who Is This Book For?
    2. Early Release Note
    3. Supporting Books & Materials
    4. Conventions Used in this Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact the Authors
    8. How to Contact Us
    9. Acknowledgments
  2. 1. Introduction to High Performance Spark
    1. Spark Versions
    2. What is Spark and Why Performance Matters
    3. What You Can Expect to Get from This Book
    4. Conclusion
  3. 2. How Spark Works
    1. How Spark Fits into the Big Data Ecosystem
      1. Spark Components
    2. Spark Model of Parallel Computing: RDDs
      1. Lazy Evaluation
      2. In Memory Storage and Memory Management
      3. Immutability and the RDD Interface
      4. Types of RDDs
      5. Functions on RDDs: Transformations vs. Actions
      6. Wide vs. Narrow Dependencies
    3. Spark Job Scheduling
      1. Resource Allocation Across Applications
      2. The Spark application
    4. The Anatomy of a Spark Job
      1. The DAG
      2. Jobs
      3. Stages
      4. Tasks
    5. Conclusion
  4. 3. DataFrames, Datasets & Spark SQL
    1. Getting Started with the HiveContext (or SQLContext)
    2. Basics of Schemas
    3. DataFrame API
      1. Transformations
      2. Multi DataFrame Transformations
      3. Plain Old SQL Queries and Interacting with Hive Data
    4. Data Representation in DataFrames & Datasets
      1. Tungsten
    5. Data Loading and Saving Functions
      1. DataFrameWriter and DataFrameReader
      2. Formats
      3. Save Modes
      4. Partitions (Discovery and Writing)
    6. Datasets
      1. Interoperability with RDDs, DataFrames, and Local Collections
      2. Compile Time Strong Typing
      3. Easier Functional (RDD “like”) Transformations
      4. Relational Transformations
      5. Multi-Dataset Relational Transformations
      6. Grouped Operations on Datasets
    7. Extending with User Defined Functions & Aggregate Functions (UDFs, UDAFs)
    8. Query Optimizer
      1. Logical and Physical Plans
      2. Code Generation
    9. JDBC/ODBC Server
    10. Conclusion
  5. 4. Joins (SQL & Core)
    1. Core Spark Joins
      1. Choosing a Join Type
      2. Choosing an Execution Plan
    2. Spark SQL Joins
      1. DataFrame Joins
      2. Dataset Joins
    3. Conclusion