We will now shift from discussing programming models and APIs to the systems that implement them. A model and API allows users to describe what they want to compute. Actually running the computation accurately at scale requires a system - usually a distributed system.
In this chapter we will focus on how an implementing system can correctly implement the Beam model to produce accurate results. Streaming systems often talk about “exactly-once processing” - ensuring that every record is processed exactly one time. We will explain what we mean by this, and how it might be implemented.
As a motivating example, this chapter will focus on techniques used by Google Cloud Dataflow to efficiently guarantee exactly-once processing of record. Towards the end of the chapter, we will also look at techniques used by some other popular streaming systems to guarantee exactly once.
It almost goes without saying that for many users, any risk of dropped records or data loss in their data-processing pipelines unacceptable. Even so, historically many general-purpose streaming systems made no guarantees about record processing - all processing was “best effort” only. Other systems provided at-least-once guarantees, ensuring that records were always processed at least once, but records might be duplicated (and thus result in inaccurate aggregations); in practice, many such at-least-once systems performed aggregations in memory, and thus ...