Chapter 5. Real-Time Micro-Batch Processing in Azure

In the previous chapter, we explored the tuple-at-a-time options in Azure for processing real-time, streaming data. In this chapter we focus on the options that take a micro-batch approach to data processing (see Figure 5-1).

Micro-Batch Processing in Azure

In Azure, there are three approaches that process telemetry streams, such as those coming from an Event Hub or IoT Hub, in small batches.  Two of these options (Spark Streaming and Storm) run on managed HDInsight clusters and one of them (Azure Stream Analytics) is purely a managed service with no infrastructure you have to manage at all.

Spark Streaming on HDInsight

Apache Spark provides a fast and general-purpose solution for in-memory and distributed computing, providing APIs that are programmable with the Scala, Java, Python, and R languages. The unique value of Spark is that it provides a set of higher-level frameworks above the main functionality (referred to as Spark Core) for performing structured and SQL-based data processing (Spark SQL), machine learning (MLlib and SparkML), graph processing (GraphX), and stream processing (Spark Streaming). While there are many solutions in the wild that perform each of these functions individually, Spark is unique in how it lets you combine the frameworks to achieve your goals. For example, you can write a single streaming application that uses Spark Streaming as the data processing framework that internally uses SQL queries ...

Get Mastering Azure Analytics, 1st Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.