Chapter 3 introduced watermarks as part of the answer to the fundamental question of when in processing time results are materialized. This chapter will provide a deeper dive, demonstrating how watermarks connect the event time and the processing time domains. We will discuss how watermarks are created at the point of data ingress, and then propagated through a data processing pipeline, all while preserving guarantees that are necessary for answering the questions of When while dealing with unbounded data.
Consider any pipeline ingesting data continuously and outputting results continuously. We wish to solve the problem of when it is safe to call an event time window closed - meaning the window does not expect any more data. To do so we would like to characterize the progress that the pipeline is making relative to its unbounded input.
One naive approach for solving the event-time windowing problem would be to simply base our event time windows on the current processing time. As we saw in Chapter 1, we quickly run into trouble - data processing and transport is not instantaneous so processing and event times are almost never equal. Any hiccup or spike in our pipeline may cause us to incorrectly assign messages to windows. Ultimately, this strategy fails as we have no robust way to make any guarantees about such windows.
Another intuitive, but ultimately incorrect, approach would be to consider the rate of messages processed by the pipeline. ...