We will now shift from discussing programming models and APIs to the systems that implement them. A model and API allows users to describe what they want to compute. Actually running the computation accurately at scale requires a system - usually a distributed system.
In this chapter we will focus on how an implementing system can correctly implement the Beam model to produce accurate results. Streaming systems often talk about “exactly-once processing” - ensuring that every record is processed exactly one time. We will explain what we mean by this, and how it might be implemented.
Google Cloud Dataflow is a hosted service that runs Beam pipelines. In this chapter, we will examine some techniques used by Dataflow to efficiently and accurately process records through Beam pipelines exactly once.
It almost goes without saying that for many users, any risk of dropped records or data loss in their data-processing pipelines unacceptable. Even so, historically many general-purpose streaming systems made no guarantees about record processing - all processing was “best effort” only. Other systems provided at-least-once guarantees, ensuring that records were always processed at least once, but records might be duplicated (and thus result in inaccurate aggregations); in practice, many such at-least-once systems performed aggregations in memory, and thus their aggregations could still be lost when machines crashed.
As a result, ...