O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Streaming Systems

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles. You’ll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

Streaming data is a big deal in big data these days, and for good reason. Businesses crave ever more timely data, and streaming is a good way to achieve lower latency. Plus, streaming is a much easier way to tame the massive, unbounded data sets that are increasingly common today.

Expanded from co-author Tyler Akidau’s popular series of blog posts "Streaming 101" and "Streaming 102", this practical book shows data engineers, data scientists, and developers how to work with streaming or event-time data in a conceptual and platform-agnostic way. You’ll go from "101"-level understanding of stream processing to a nuanced grasp of the what, where, when, and how of processing real-time data streams.

Dive deep into topics including watermarks and windowing, as well as state and timers in the context of stream processing. Although the book uses Apache Beam code snippets to make examples concrete, it presents a general and broad explanation of streaming that's not tied to a specific framework.

Table of Contents

  1. 1. Streaming 101
    1. Terminology: what is streaming?
      1. On The Greatly Exaggerated Limitations of Streaming
      2. Event Time vs. Processing Time
    2. Data Processing Patterns
      1. Bounded Data
      2. Unbounded data: batch
      3. Unbounded data: streaming
    3. Summary
  2. 2. The What, Where, When, and How of Data Processing
    1. Roadmap
    2. Batch Foundations: What & Where
      1. What: Transformations
      2. Where: Windowing
    3. Going Streaming: When & How
      1. When: The wonderful thing about triggers, is triggers are wonderful things!
      2. When: Watermarks
      3. When: early/on-time/late triggers FTW!
      4. When: Allowed Lateness (i.e., Garbage Collection)
      5. How: Accumulation
    4. Summary
  3. 3. Watermarks
    1. Definition
    2. Source Watermark Creation
      1. Perfect watermark creation
      2. Heuristic watermark creation
    3. Watermark Propagation
      1. Watermark Propagation Example
    4. Watermarks, Windows, and Output Timestamps
      1. Output timestamps
      2. The tricky case of overlapping windows
    5. Percentile Watermarks
    6. Processing-Time Watermarks
    7. Case Study: Watermarks in Google Cloud Dataflow
    8. Section Title
    9. Summary
  4. 4. Advanced windowing
    1. When/where: processing-time windows 
      1. Event-time windowing
      2. Processing-time windowing via triggers
      3. Processing-time windowing via ingress time
      4. Where: session windows
    2. Where: custom windowing
      1. Variations on fixed windows
      2. Variations on session windows
      3. One size does not fit all
    3. Summary
  5. 5. Exactly-once & side effects
    1. Why exactly once matters
    2. Accuracy v.s. completeness
      1. Side effects
      2. Problem definition
    3. Ensuring exactly once in shuffle
    4. Addressing determinism
    5. Performance
      1. Graph optimization
      2. Bloom filters
      3. Garbage collection
    6. Exactly once in sources
    7. Exactly once in sinks
    8. Use cases
      1. Example source: Cloud Pub/Sub
      2. Example sink: files
      3. Example sink: Google Cloud BigQuery
    9. Summary
  6. 6. Streams & tables
    1. Stream & table basics or: a special theory of stream & table relativity
      1. Toward a general theory of stream & table relativity
    2. Batch processing vs Streams & Tables
      1. A streams & tables analysis of MapReduce
      2. Summary
    3. What, where, when, & how in a streams/tables world
      1. What: transformations
      2. Where: windowing
      3. When: triggers
      4. How: accumulation
      5. A holistic view of streams & tables in the Beam model
    4. A general theory of stream & table relativity
    5. Summary
  7. 7. The Practicalities of Persistent State
    1. Motivation
      1. Section Title
      2. Correctness and efficiency
    2. Implicit state
      1. Raw grouping
      2. Incremental combining
    3. Generalized state
      1. Case study: conversion attribution
      2. Characteristics of a general state API
      3. Persistent state in Apache Beam
    4. Summary
  8. 8. Streaming SQL
    1. What is streaming SQL?
      1. Relational algebra
      2. Time-varying relations
      3. Streams & tables
    2. Looking backwards: stream & table biases
      1. The Beam model: a stream-biased approach
      2. The SQL model: a table-biased approach
    3. Looking forwards: towards robust streaming SQL
      1. Stream/table selection
      2. Temporal operators
    4. Summary
  9. 9. Streaming Joins
    1. All your joins are belong to streaming
    2. Unwindowed joins
      1. FULL OUTER
      2. LEFT OUTER
      3. RIGHT OUTER
      4. INNER
      5. ANTI
      6. SEMI
      7. Summary: unwindowed joins 
    3. Windowed joins
      1. Temporal validity windows
      2. Temporal validity joins
    4. Summary
  10. 10. The Evolution of Large-Scale Data Processing
    1. MapReduce
    2. Hadoop
    3. Flume
    4. Storm
    5. Spark
    6. MillWheel
    7. Kafka
    8. Cloud Dataflow
    9. Flink
    10. Beam
    11. Summary