You are previewing Learning Real-time Processing with Spark Streaming.
O'Reilly logo
Learning Real-time Processing with Spark Streaming

Book Description

Building scalable and fault-tolerant streaming applications made easy with Spark streaming

About This Book

  • Process live data streams more efficiently with better fault recovery using Spark Streaming

  • Implement and deploy real-time log file analysis

  • Learn about integration with Advance Spark Libraries - GraphX, Spark SQL, and MLib.

  • Who This Book Is For

    This book is intended for big data developers with basic knowledge of Scala but no knowledge of Spark. It will help you grasp the basics of developing real-time applications with Spark and understand efficient programming of core elements and applications.

    What You Will Learn

  • Install and configure Spark and Spark Streaming to execute applications

  • Explore the architecture and components of Spark and Spark Streaming to use it as a base for other libraries

  • Process distributed log files in real-time to load data from distributed sources

  • Apply transformations on streaming data to use its functions

  • Integrate Apache Spark with the various advance libraries like MLib and GraphX

  • Apply production deployment scenarios to deploy your application

  • In Detail

    Using practical examples with easy-to-follow steps, this book will teach you how to build real-time applications with Spark Streaming.

    Starting with installing and setting the required environment, you will write and execute your first program for Spark Streaming. This will be followed by exploring the architecture and components of Spark Streaming along with an overview of libraries/functions exposed by Spark. Next you will be taught about various client APIs for coding in Spark by using the use-case of distributed log file processing. You will then apply various functions to transform and enrich streaming data. Next you will learn how to cache and persist datasets. Moving on you will integrate Apache Spark with various other libraries/components of Spark like Mlib, GraphX, and Spark SQL. Finally, you will learn about deploying your application and cover the different scenarios ranging from standalone mode to distributed mode using Mesos, Yarn, and private data centers or on cloud infrastructure.

    Style and approach

    A Step-by-Step approach to learn Spark Streaming in a structured manner, with detailed explanation of basic and advance features in an easy-to-follow Style. Each topic is explained sequentially and supported with real world examples and executable code snippets that appeal to the needs of readers with the wide range of experiences.

    Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Learning Real-time Processing with Spark Streaming
      1. Table of Contents
      2. Learning Real-time Processing with Spark Streaming
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Installing and Configuring Spark and Spark Streaming
        1. Installation of Spark
          1. Hardware requirements
            1. CPU
            2. RAM
            3. Disk
            4. Network
            5. Operating system
          2. Software requirements
            1. Spark
            2. Java
            3. Scala
            4. Eclipse
          3. Installing Spark extensions – Spark Streaming
        2. Configuring and running the Spark cluster
        3. Your first Spark program
          1. Coding Spark jobs in Scala
          2. Coding Spark jobs in Java
        4. Tools and utilities for administrators/developers
          1. Cluster management
          2. Submitting Spark jobs
        5. Troubleshooting
          1. Configuring port numbers
          2. Classpath issues – class not found exception
          3. Other common exceptions
        6. Summary
      9. 2. Architecture and Components of Spark and Spark Streaming
        1. Batch versus real-time data processing
          1. Batch processing
          2. Real-time data processing
        2. Architecture of Spark
          1. Spark versus Hadoop
          2. Layered architecture – Spark
        3. Architecture of Spark Streaming
          1. What is Spark Streaming?
          2. High-level architecture – Spark Streaming
        4. Your first Spark Streaming program
          1. Coding Spark Streaming jobs in Scala
          2. Coding Spark Streaming jobs in Java
          3. The client application
          4. Packaging and deploying a Spark Streaming job
        5. Summary
      10. 3. Processing Distributed Log Files in Real Time
        1. Spark packaging structure and client APIs
          1. Spark Core
            1. SparkContext and Spark Config – Scala APIs
            2. SparkContext and Spark Config – Java APIs
            3. RDD – Scala APIs
            4. RDD – Java APIs
            5. Other Spark Core packages
          2. Spark libraries and extensions
            1. Spark Streaming
            2. Spark MLlib
            3. Spark SQL
            4. Spark GraphX
        2. Resilient distributed datasets and discretized streams
          1. Resilient distributed datasets
            1. Motivation behind RDD
            2. Fault tolerance
            3. Transformations and actions
            4. RDD storage
            5. RDD persistence
            6. Shuffling in RDD
          2. Discretized streams
        3. Data loading from distributed and varied sources
          1. Flume architecture
          2. Installing and configuring Flume
          3. Configuring Spark to consume Flume events
          4. Packaging and deploying a Spark Streaming job
          5. Overall architecture of distributed log file processing
        4. Summary
      11. 4. Applying Transformations to Streaming Data
        1. Understanding and applying transformation functions
          1. Simulating log streaming
          2. Functional operations
          3. Transform operations
          4. Windowing operations
        2. Performance tuning
          1. Partitioning and parallelism
          2. Serialization
          3. Spark memory tuning
            1. Garbage collection
            2. Object sizes
            3. Executor memory and caching RDDs
        3. Summary
      12. 5. Persisting Log Analysis Data
        1. Output operations in Spark Streaming
        2. Integration with Cassandra
          1. Installing and configuring Apache Cassandra
          2. Configuring Spark for integration with Cassandra
          3. Coding Spark jobs for persisting streaming web logs in Cassandra
        3. Summary
      13. 6. Integration with Advanced Spark Libraries
        1. Querying streaming data in real time
          1. Understanding Spark SQL
          2. Integrating Spark SQL with streams
        2. Graph analysis – Spark GraphX
          1. Introduction to the GraphX API
          2. Integration with Spark Streaming
        3. Summary
      14. 7. Deploying in Production
        1. Spark deployment models
          1. Deploying on Apache Mesos
            1. Installing and configuring Apache Mesos
            2. Integrating and executing Spark applications on Apache Mesos
          2. Deploying on Hadoop or YARN
        2. High availability and fault tolerance
          1. High availability in the standalone mode
          2. High availability in Mesos or YARN
          3. Fault tolerance
            1. Fault tolerance in Spark Streaming
        3. Monitoring streaming jobs
          1. Application or job UI
          2. Integration with other monitoring tools
        4. Summary
      15. Index