Distributed streaming platform

So far in this book, we have been performing batch processing—that is, we have been provided with bounded raw data files and processed that data as a group. As we saw in Chapter 1, The Big Data Ecosystem, stream processing differs from batch processing in the fact that data is processed as and when individual units, or streams, of data arrive. We also saw in Chapter 1, The Big Data Ecosystem, how Apache Kafka, as a distributed streaming platform, allows us to move real-time data between systems and applications in a fault-tolerant and reliable manner via a logical streaming architecture comprising of the following components:

  • Producers: Applications that generate and send messages
  • Consumers: Applications that ...

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.