Cover image for Learning Spark

Book description

The Web is getting faster, and the data it delivers is getting bigger. How can you handle everything efficiently? This book introduces Spark, an open source cluster computing system that makes data analytics fast to run and fast to write. With Spark, your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

Table of Contents

  1. Preface
    1. Audience
    2. How This Book is Organized
    3. Supporting Books
    4. Code Examples
    5. Early Release Status and Feedback
  2. 1. Introduction to Data Analysis with Spark
    1. What is Apache Spark?
    2. A Unified Stack
      1. Spark Core
      2. Spark SQL
      3. Spark Streaming
      4. MLlib
      5. GraphX
      6. Cluster Managers
    3. Who Uses Spark, and For What?
      1. Data Science Tasks
      2. Data Processing Applications
    4. A Brief History of Spark
    5. Spark Versions and Releases
    6. Spark and Hadoop
  3. 2. Downloading and Getting Started
    1. Downloading Spark
    2. Introduction to Spark’s Python and Scala Shells
    3. Introduction to Core Spark Concepts
    4. Standalone Applications
      1. Initializing a SparkContext
    5. Conclusion
  4. 3. Programming with RDDs
    1. RDD Basics
    2. Creating RDDs
    3. RDD Operations
      1. Transformations
      2. Actions
      3. Lazy Evaluation
    4. Passing Functions to Spark
      1. Python
      2. Scala
      3. Java
    5. Common Transformations and Actions
      1. Basic RDDs
        1. Transformations
        2. Element-wise transformations
        3. Pseudo Set Operations
        4. Actions
      2. Converting Between RDD Types
        1. Scala
        2. Java
        3. Python
    6. Persistence (Caching)
    7. Conclusion
  5. 4. Working with Key-Value Pairs
    1. Motivation
    2. Creating Pair RDDs
    3. Transformations on Pair RDDs
      1. Aggregations
        1. Tuning the Level of Parallelism
      2. Grouping Data
      3. Joins
      4. Sorting Data
    4. Actions Available on Pair RDDs
    5. Data Partitioning
      1. Determining an RDD’s Partitioner
      2. Operations that Benefit from Partitioning
      3. Operations that Affect Partitioning
      4. Example: PageRank
      5. Custom Partitioners
    6. Conclusion
  6. 5. Loading and Saving Your Data
    1. Motivation
    2. Choosing a Format
    3. Formats
      1. Text Files
      2. JSON
      3. CSV (Comma Separated Values) / TSV (Tab Separated Values)
      4. Sequence Files
      5. Object Files
      6. Hadoop Input and Output Formats
        1. Protocol Buffers
      7. Hive and Parquet
    4. File Systems
      1. Local/"Regular” FS
        1. Amazon S3
      2. HDFS
    5. Compression
    6. Databases
      1. Elasticsearch
      2. Mongo
      3. Cassandra
      4. HBase
      5. Java Database Connectivity (JDBC)
    7. Conclusion
  7. About the Authors
  8. Copyright