Distributed streaming

Imagine processing the data stored in a traditional spreadsheet or text-based delimited files such as a CSV file. The type of processing that you will typically execute when using these types of data stores is referred to as batch processing. In batch processing, data is collated into some sort of group, in this case, the collection of lines in our spreadsheet or CSV file, and processed together as a group at some future time and date. Typically, these spreadsheets or CSV files will be refreshed with updated data at some juncture, at which point the same, or similar, processing will be undertaken, potentially all managed by some sort of schedule or timer. Traditionally, data processing systems would have been developed ...

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.