O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Streaming Systems

Book Description

Streaming data is a big deal in big data these days. As more and more businesses seek to tame the massive unbounded data sets that pervade our world, streaming systems have finally reached a level of maturity sufficient for mainstream adoption. With this practical guide, data engineers, data scientists, and developers will learn how to work with streaming data in a conceptual and platform-agnostic way.

Expanded from Tyler Akidau’s popular blog posts "Streaming 101" and "Streaming 102", this book takes you from an introductory level to a nuanced understanding of the what, where, when, and how of processing real-time data streams. You’ll also dive deep into watermarks and exactly-once processing with co-authors Slava Chernyak and Reuven Lax.

You’ll explore:

  • How streaming and batch data processing patterns compare
  • The core principles and concepts behind robust out-of-order data processing
  • How watermarks track progress and completeness in infinite datasets
  • How exactly-once data processing techniques ensure correctness
  • How the concepts of streams and tables form the foundations of both batch and streaming data processing
  • The practical motivations behind a powerful persistent state mechanism, driven by a real-world example
  • How time-varying relations provide a link between stream processing and the world of SQL and relational algebra

Table of Contents

  1. Preface Or: What Are You Getting Yourself Into Here?
    1. Navigating This Book
      1. Takeaways
    2. Conventions Used in This Book
    3. Online Resources
      1. Figures
      2. Code Snippets
    4. O’Reilly Safari
    5. How to Contact Us
    6. Acknowledgments
  2. I. The Beam Model
  3. 1. Streaming 101
    1. Terminology: What Is Streaming?
      1. On the Greatly Exaggerated Limitations of Streaming
      2. Event Time Versus Processing Time
    2. Data Processing Patterns
      1. Bounded Data
      2. Unbounded Data: Batch
      3. Unbounded Data: Streaming
    3. Summary
  4. 2. The What, Where, When, and How of Data Processing
    1. Roadmap
    2. Batch Foundations: What and Where
      1. What: Transformations
      2. Where: Windowing
    3. Going Streaming: When and How
      1. When: The Wonderful Thing About Triggers Is Triggers Are Wonderful Things!
      2. When: Watermarks
      3. When: Early/On-Time/Late Triggers FTW!
      4. When: Allowed Lateness (i.e., Garbage Collection)
      5. How: Accumulation
    4. Summary
  5. 3. Watermarks
    1. Definition
    2. Source Watermark Creation
      1. Perfect Watermark Creation
      2. Heuristic Watermark Creation
    3. Watermark Propagation
      1. Understanding Watermark Propagation
      2. Watermark Propagation and Output Timestamps
      3. The Tricky Case of Overlapping Windows
    4. Percentile Watermarks
    5. Processing-Time Watermarks
    6. Case Studies
      1. Case Study: Watermarks in Google Cloud Dataflow
      2. Case Study: Watermarks in Apache Flink
      3. Case Study: Source Watermarks for Google Cloud Pub/Sub
    7. Summary
  6. 4. Advanced Windowing
    1. When/Where: Processing-Time Windows
      1. Event-Time Windowing
      2. Processing-Time Windowing via Triggers
      3. Processing-Time Windowing via Ingress Time
    2. Where: Session Windows
    3. Where: Custom Windowing
      1. Variations on Fixed Windows
      2. Variations on Session Windows
      3. One Size Does Not Fit All
    4. Summary
  7. 5. Exactly-Once and Side Effects
    1. Why Exactly Once Matters
    2. Accuracy Versus Completeness
      1. Side Effects
      2. Problem Definition
    3. Ensuring Exactly Once in Shuffle
    4. Addressing Determinism
    5. Performance
      1. Graph Optimization
      2. Bloom Filters
      3. Garbage Collection
    6. Exactly Once in Sources
    7. Exactly Once in Sinks
    8. Use Cases
      1. Example Source: Cloud Pub/Sub
      2. Example Sink: Files
      3. Example Sink: Google BigQuery
    9. Other Systems
      1. Apache Spark Streaming
      2. Apache Flink
    10. Summary
  8. II. Streams and Tables
  9. 6. Streams and Tables
    1. Stream-and-Table Basics Or: a Special Theory of Stream and Table Relativity
      1. Toward a General Theory of Stream and Table Relativity
    2. Batch Processing Versus Streams and Tables
      1. A Streams and Tables Analysis of MapReduce
      2. Reconciling with Batch Processing
    3. What, Where, When, and How in a Streams and Tables World
      1. What: Transformations
      2. Where: Windowing
      3. When: Triggers
      4. How: Accumulation
      5. A Holistic View of Streams and Tables in the Beam Model
    4. A General Theory of Stream and Table Relativity
    5. Summary
  10. 7. The Practicalities of Persistent State
    1. Motivation
      1. The Inevitability of Failure
      2. Correctness and Efficiency
    2. Implicit State
      1. Raw Grouping
      2. Incremental Combining
    3. Generalized State
      1. Case Study: Conversion Attribution
      2. Conversion Attribution with Apache Beam
    4. Summary
  11. 8. Streaming SQL
    1. What Is Streaming SQL?
      1. Relational Algebra
      2. Time-Varying Relations
      3. Streams and Tables
    2. Looking Backward: Stream and Table Biases
      1. The Beam Model: A Stream-Biased Approach
      2. The SQL Model: A Table-Biased Approach
    3. Looking Forward: Toward Robust Streaming SQL
      1. Stream and Table Selection
      2. Temporal Operators
    4. Summary
  12. 9. Streaming Joins
    1. All Your Joins Are Belong to Streaming
    2. Unwindowed Joins
      1. FULL OUTER
      2. LEFT OUTER
      3. RIGHT OUTER
      4. INNER
      5. ANTI
      6. SEMI
    3. Windowed Joins
      1. Fixed Windows
      2. Temporal Validity
    4. Summary
  13. 10. The Evolution of Large-Scale Data Processing
    1. MapReduce
    2. Hadoop
    3. Flume
    4. Storm
    5. Spark
    6. MillWheel
    7. Kafka
    8. Cloud Dataflow
    9. Flink
    10. Beam
    11. Summary
  14. Index