Okay party people, it’s time to get concrete!
The last two chapters focused on three main areas: terminology, defining precisely what I mean when I use overloaded terms like “streaming”; batch vs streaming, comparing the theoretical capabilities of the two types of systems, and postulating that only two things are necessary to take streaming systems beyond their batch counterparts: correctness and tools for reasoning about time; and data processing patterns, looking at the conceptual approaches taken with both batch and streaming systems when processing bounded and unbounded data.
In this chapter, we’re now going to focus further on the data processing patterns from Chapter 2, but in more detail, and within the context of concrete examples. By the time we’re finished, we’ll have covered what I consider to be the core set of principles and concepts required for robust out-of-order data processing; these are the tools for reasoning about time that truly get you beyond classic batch processing.
To give you a sense of what things look like in action, I’ll use snippets of Apache Beam code, coupled with time-lapse diagrams1 to provide a visual representation of the concepts. Apache Beam is a unified programming model and portability layer for batch and stream processing, with a set of concrete SDKs in various languages (e.g. Java, Python, etc.). Pipelines written with Apache Beam can then be portably run on any of the supported ...