The Spark ecosystem

Apache Spark powers a number of tools, both as a library and as an execution engine.

Spark Streaming

Spark Streaming (found at http://spark.apache.org/docs/latest/streaming-programming-guide.html) is an extension of the Scala API that allows data ingestion from streams such as Kafka, Flume, Twitter, ZeroMQ, and TCP sockets.

Spark Streaming receives live input data streams and divides the data into batches (arbitrarily sized time windows), which are then processed by the Spark core engine to generate the final stream of results in batches. This high-level abstraction is called DStream (org.apache.spark.streaming.dstream.DStreams) and is implemented as a sequence of RDDs. DStream allows for two kinds of operations: transformations ...

Get Learning Hadoop 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.