Chapter 2. Stream-Processing Model

In this chapter, we bridge the notion of a data streamâa source of data âon the moveââwith the programming language primitives and constructs that allow us to express stream processing.

We want to describe simple, fundamental concepts first before moving on to how Apache Spark represents them. Specifically, we want to cover the following as components of stream processing:

Data sources
Stream-processing pipelines
Data sinks

We then show how those concepts map to the specific stream-processing model implemented by Apache Spark.

Next, we characterize stateful stream processing, a type of stream processing that requires bookkeeping of past computations in the form of some intermediate state needed to process new data. Finally, we consider streams of timestamped events and basic notions involved in addressing concerns such as âwhat do I do if the order and timeliness of the arrival of those events do not match expectations?â

Sources and Sinks

As we mentioned earlier, Apache Spark, in each of its two streaming systemsâStructured Streaming and Spark Streamingâis a programming framework with APIs in the Scala, Java, Python, and R programming languages. It can only operate on data that enters the runtime of programs using this framework, and it ceases to operate on the data as soon as it is being sent to another system.

This is a concept that you are probably already familiar with in the context of data at rest: to operate on ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Stream Processing with Apache Spark by

Chapter 2. Stream-Processing Model

Sources and Sinks

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly