O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Streaming Systems

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles. You’ll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

Streaming data is a big deal in big data these days, and for good reason. Businesses crave ever more timely data, and streaming is a good way to achieve lower latency. Plus, streaming is a much easier way to tame the massive, unbounded data sets that are increasingly common today.

Expanded from co-author Tyler Akidau’s popular series of blog posts "Streaming 101" and "Streaming 102", this practical book shows data engineers, data scientists, and developers how to work with streaming or event-time data in a conceptual and platform-agnostic way. You’ll go from "101"-level understanding of stream processing to a nuanced grasp of the what, where, when, and how of processing real-time data streams.

Dive deep into topics including watermarks and windowing, as well as state and timers in the context of stream processing. Although the book uses Apache Beam code snippets to make examples concrete, it presents a general and broad explanation of streaming that's not tied to a specific framework.

Table of Contents

  1. Preface or: what are you getting yourself into here?
    1. Navigating this book
      1. Takeaways
    2. Conventions Used in This Book
    3. Online Resources
      1. Figures
      2. Code snippets
    4. O’Reilly Safari
    5. How to Contact Us
    6. Acknowledgements
  2. 1. Streaming 101
    1. Terminology: what is streaming?
      1. On The Greatly Exaggerated Limitations of Streaming
      2. Event Time vs. Processing Time
    2. Data Processing Patterns
      1. Bounded Data
      2. Unbounded data: batch
      3. Unbounded data: streaming
    3. Summary
  3. 2. The What, Where, When, and How of Data Processing
    1. Roadmap
    2. Batch Foundations: What & Where
      1. What: Transformations
      2. Where: Windowing
    3. Going Streaming: When & How
      1. When: The wonderful thing about triggers, is triggers are wonderful things!
      2. When: Watermarks
      3. When: early/on-time/late triggers FTW!
      4. When: Allowed Lateness (i.e., Garbage Collection)
      5. How: Accumulation
    4. Summary
  4. 3. Watermarks
    1. Definition
    2. Source Watermark Creation
      1. Perfect watermark creation
      2. Heuristic watermark creation
    3. Watermark Propagation
      1. Understanding Watermark Propagation
      2. Watermark propagation and output timestamps
      3. The tricky case of overlapping windows
    4. Percentile Watermarks
    5. Processing-Time Watermarks
    6. Case studies
      1. Case Study: Watermarks in Google Cloud Dataflow
      2. Case Study: Watermarks in Apache Flink
      3. Case Study: Source Watermarks for Google Cloud Pub/Sub
    7. Summary
  5. 4. Advanced windowing
    1. When/where: processing-time windows
      1. Event-time windowing
      2. Processing-time windowing via triggers
      3. Processing-time windowing via ingress time
    2. Where: session windows
    3. Where: custom windowing
      1. Variations on fixed windows
      2. Variations on session windows
      3. One size does not fit all
    4. Summary
  6. 5. Exactly-once & side effects
    1. Why exactly once matters
    2. Accuracy vs completeness
      1. Side effects
      2. Problem definition
    3. Ensuring exactly once in shuffle
    4. Addressing determinism
    5. Performance
      1. Graph optimization
      2. Bloom filters
      3. Garbage collectionc
    6. Exactly once in sources
    7. Exactly once in sinks
    8. Use cases
      1. Example source: Cloud Pub/Sub
      2. Example sink: files
      3. Example sink: Google BigQuery
    9. Other systems
      1. Apache Spark Streaming
      2. Apache Flink
    10. Summary
  7. 6. Streams & tables
    1. Stream & table basics or: a special theory of stream & table relativity
      1. Toward a general theory of stream & table relativity
    2. Batch processing vs Streams & Tables
      1. A streams & tables analysis of MapReduce
      2. Summary
    3. What, where, when, & how in a streams/tables world
      1. What: transformations
      2. Where: windowing
      3. When: triggers
      4. How: accumulation
      5. A holistic view of streams & tables in the Beam model
    4. A general theory of stream & table relativity
    5. Summary
  8. 7. The Practicalities of Persistent State
    1. Motivation
      1. The inevitability of failure
      2. Correctness and efficiency
    2. Implicit state
      1. Raw grouping
      2. Incremental combining
    3. Generalized state
      1. Case study: conversion attribution
      2. Conversion attribution with Apache Beam
    4. Summary
  9. 8. Streaming SQL
    1. What is streaming SQL?
      1. Relational algebra
      2. Time-varying relations
      3. Streams & tables
    2. Looking backwards: stream & table biases
      1. The Beam model: a stream-biased approach
      2. The SQL model: a table-biased approach
    3. Looking forwards: towards robust streaming SQL
      1. Stream/table selection
      2. Temporal operators
    4. Summary
  10. 9. Streaming Joins
    1. All your joins are belong to streaming
    2. Unwindowed joins
      1. FULL OUTER
      2. LEFT OUTER
      3. RIGHT OUTER
      4. INNER
      5. ANTI
      6. SEMI
    3. Windowed joins
      1. Fixed windows
      2. Temporal validity
    4. Summary
  11. 10. The Evolution of Large-Scale Data Processing
    1. MapReduce
    2. Hadoop
    3. Flume
    4. Storm
    5. Spark
    6. MillWheel
    7. Kafka
    8. Cloud Dataflow
    9. Flink
    10. Beam
    11. Summary
  12. Index