O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Spark Streaming

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles. You’ll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

To build analytics tools that provide faster insights, knowing how to process data in real time is a must, and moving from batch processing to stream processing is absolutely required. Fortunately, the Spark in-memory framework/platform for processing data has added an extension devoted to fault-tolerant stream processing: Spark Streaming.

If you're familiar with Apache Spark and want to learn how to implement it for streaming jobs, this practical book is a must.

  • Understand how Spark Streaming fits in the big picture
  • Learn core concepts such as Spark RDDs, Spark Streaming clusters, and the fundamentals of a DStream
  • Discover how to create a robust deployment
  • Dive into streaming algorithmics
  • Learn how to tune, measure, and monitor Spark Streaming

Table of Contents

  1. 1. Introducing Spark Streaming
    1. Large-scale data analytics and Apache Spark
    2. More than MapReduce : how the model came about and how Spark extends it.
      1. A Fault-tolerant MapReduce cluster
      2. A distributed file system
      3. Two higher-order functions
    3. Optimizations in a reduce operation
      1. Associativity : a necessary condition.
      2. Shuffling
      3. Map-side combiner
    4. To Learn more about MapReduce
      1. The Spark ecosystem, approach and polyglot APIs
      2. Multiple frameworks, and a framework scheduler
      3. A Data Processing engine
      4. A polyglot API
      5. A MapReduce extension
      6. A SQL interface, expanding into a DataFrame interface.
      7. A Real Time processing engine
      8. In-memory computing, with impact on processing speed and latency
      9. MapReduce and memory legacy
      10. Spark’s Memory Usage
      11. A customizable cache
      12. Operation Latency
    5. How Spark Streaming fits in the Big Picture
      1. Micro-batching
      2. A strong Streaming characteristic
      3. A minimal delay
      4. Throughput-oriented tasks
    6. Why you would want to use Spark Streaming
      1. Building a pipeline
      2. Productive deployment of pipelines
      3. Productive implementation of data analysis
    7. To learn more about Spark
    8. Conclusion
    9. Bibliography
  2. 2. Core Spark Streaming concepts
    1. Apache Spark RDDs
      1. Resilient Distributed Datasets
      2. Transformations and Actions
      3. The Shuffle
      4. Partitions
      5. Debugging RDDs
      6. Witnessing caching
    2. Spark Streaming Clusters
      1. The Standalone Spark cluster
      2. Yet Another Resource Negotiator (YARN)
      3. Apache Mesos
      4. Spark Streaming : a delicate deployment
    3. To learn more about runinng Spark on a cluster
    4. Fundamentals of a DStream
      1. A Bulk-synchronous model
      2. The Spark Streaming Context
      3. Representing regular updates to a fixed window of data
      4. The Receiver Model
      5. Receiver parallelism
    5. Conclusion
    6. Bibliography
  3. 3. Streaming application design
    1. Starting with an example : Twitter analysis
      1. The Spark Notebook
      2. Creating a Streaming Application
      3. Creating a Stream
      4. Transformations
      5. Actions and Dataflow
      6. Expressing a Dataflow
      7. Starting the Spark Streaming Context
      8. Summary
    2. Windowed Streams
      1. Windowed Streams
      2. A word on changing the batch interval
      3. Slicing your Stream
    3. Other Data Sources and Connectors
      1. Apache Kafka
      2. Apache Flume
      3. Kinesis
      4. Apache Bahir
      5. How to write a quick stream generator for testing : SocketStream , FileStream , QueueStream
    4. The Lambda Architecture
      1. The evolution of ideas, rather than products
      2. A classical but difficult example
      3. Batch processing and a program’s life time
      4. A Streaming improvement
      5. A fundamental difficulty: back to the Lambda architecture ?
    5. Saving Streams
      1. Stream Output and other operations
      2. A word on content selection
      3. Reasons for saving a stream and scaling into real-time
      4. How to Save Streams with DataFrames
    6. Bibliography
  4. 4. Creating robust deployments
    1. Using spark-submit
    2. Thinking about reliability in Spark Streaming: Closures and Function-Passing Style
    3. Spark’s Reliability primitives
    4. Spark’s Fault Tolerance Guarantees
      1. The External shuffle service
      2. Cluster-mode deployment
      3. Checkpointing
      4. A hot-swappable master through Zookeeper
    5. Fault-tolerance in Spark Streaming: the context of the Receiver model
    6. Spark Streaming’s Zero Data Loss guarantees
    7. Cluster managers and driver restart
    8. Comparing cluster managers
    9. Job stability: A time budget question
      1. Batch interval and processing delay
      2. Going deeper : scheduling delay and processing delay
      3. Fixed-rate throttling
    10. Backpressure
      1. Why backpressure
      2. Dynamic throttling
      3. Tuning the backpressure PID
    11. Fault tolerance in Spark Streaming
      1. Planning for side effect stutter in transformations
      2. Idempotent side effects for exactly once processing
      3. Checkpointing and its importance
    12. The Reliable Receiver and the Write-Ahead Log
    13. Apache Kafka and the DirectKafkaReceiver
      1. The Kafka model and its Receiver
    14. Parallel consumers
      1. The Receiver model vs. reliable receivers
    15. Bibliography
  5. 5. Streaming Programming API
    1. Basic Stream transformations
      1. Element-centric DStream Operations
      2. RDD-centric DStream Operations
      3. Counting
    2. Output Operations
      1. foreachRDD
      2. 3rd Party Output Operations
    3. Spark SQL and Spark Streaming
    4. Spark SQL
      1. Accessing Spark SQL Functions From Spark Streaming
      2. Dealing with Data at Rest
      3. Join Optimizations
      4. Updating Reference Data
    5. Stateful Streaming Computation
      1. UpdateStateByKey
      2. Statefulness at the scale of a stream
      3. updateStateByKey and its limitations
      4. mapwithState
      5. Using mapWithState
      6. Event-time Stream computation with mapWithState
    6. Dynamic Windows
      1. reduceByWindow
      2. Invertible Aggregations
    7. Caching
    8. Measuring and Monitoring
      1. The Streaming UI
      2. The Monitoring API
      3. Conclusion
    9. Bibliography