You are previewing Fast Data Processing with Spark - Second Edition.
O'Reilly logo
Fast Data Processing with Spark - Second Edition

Book Description

Perform real-time analytics using Spark in a fast, distributed, and scalable way

In Detail

Spark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets.

Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes.

What You Will Learn

  • Install and set up Spark on your cluster

  • Prototype distributed applications with Spark's interactive shell

  • Learn different ways to interact with Spark's distributed representation of data (RDDs)

  • Query Spark with a SQL-like query syntax

  • Effectively test your distributed software

  • Recognize how Spark works with big data

  • Implement machine learning systems with highly scalable algorithms

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

    Table of Contents

    1. Fast Data Processing with Spark Second Edition
      1. Table of Contents
      2. Fast Data Processing with Spark Second Edition
      3. Credits
      4. About the Authors
      5. About the Reviewers
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Installing Spark and Setting up your Cluster
        1. Directory organization and convention
        2. Installing prebuilt distribution
        3. Building Spark from source
          1. Downloading the source
          2. Compiling the source with Maven
          3. Compilation switches
          4. Testing the installation
        4. Spark topology
        5. A single machine
        6. Running Spark on EC2
          1. Running Spark on EC2 with the scripts
          2. Deploying Spark on Elastic MapReduce
        7. Deploying Spark with Chef (Opscode)
        8. Deploying Spark on Mesos
        9. Spark on YARN
        10. Spark Standalone mode
        11. Summary
      9. 2. Using the Spark Shell
        1. Loading a simple text file
        2. Using the Spark shell to run logistic regression
        3. Interactively loading data from S3
          1. Running Spark shell in Python
        4. Summary
      10. 3. Building and Running a Spark Application
        1. Building your Spark project with sbt
        2. Building your Spark job with Maven
        3. Building your Spark job with something else
        4. Summary
      11. 4. Creating a SparkContext
        1. Scala
        2. Java
        3. SparkContext – metadata
        4. Shared Java and Scala APIs
        5. Python
        6. Summary
      12. 5. Loading and Saving Data in Spark
        1. RDDs
        2. Loading data into an RDD
        3. Saving your data
        4. Summary
      13. 6. Manipulating your RDD
        1. Manipulating your RDD in Scala and Java
          1. Scala RDD functions
          2. Functions for joining PairRDDs
          3. Other PairRDD functions
          4. Double RDD functions
          5. General RDD functions
          6. Java RDD functions
            1. Spark Java function classes
            2. Common Java RDD functions
            3. Methods for combining JavaRDDs
            4. Functions on JavaPairRDDs
        2. Manipulating your RDD in Python
          1. Standard RDD functions
          2. PairRDD functions
        3. Summary
      14. 7. Spark SQL
        1. The Spark SQL architecture
          1. Spark SQL how-to in a nutshell
          2. Spark SQL programming
            1. SQL access to a simple data table
            2. Handling multiple tables with Spark SQL
            3. Aftermath
        2. Summary
      15. 8. Spark with Big Data
        1. Parquet – an efficient and interoperable big data format
          1. Saving files to the Parquet format
          2. Loading Parquet files
          3. Saving processed RDD in the Parquet format
        2. Querying Parquet files with Impala
        3. HBase
          1. Loading from HBase
          2. Saving to HBase
          3. Other HBase operations
        4. Summary
      16. 9. Machine Learning Using Spark MLlib
        1. The Spark machine learning algorithm table
        2. Spark MLlib examples
          1. Basic statistics
          2. Linear regression
          3. Classification
          4. Clustering
          5. Recommendation
        3. Summary
      17. 10. Testing
        1. Testing in Java and Scala
          1. Making your code testable
          2. Testing interactions with SparkContext
        2. Testing in Python
        3. Summary
      18. 11. Tips and Tricks
        1. Where to find logs
        2. Concurrency limitations
          1. Memory usage and garbage collection
          2. Serialization
          3. IDE integration
        3. Using Spark with other languages
        4. A quick note on security
        5. Community developed packages
        6. Mailing lists
        7. Summary
      19. Index