Chapter 12. Streaming in the Cloud with Apache Spark

The venerable MapReduce computation framework, part of Apache Hadoop from the beginning, is falling out of favor now that newer and more flexible solutions are available. The original MapReduce implementation of job trackers and task trackers is obsoleted by YARN, which scales better and can support distributed work beyond MapReduce jobs.

One of the most popular alternatives to MapReduce is Apache Spark, which supports a wide variety of algorithms including mapping and reducing, and also manages the chaining of the distributed computations together. Much like Hive caters to users who are familiar with relational data, Spark caters to developers who can focus more on the algorithmic features of the jobs they write, so they need not try to hammer them into the MapReduce mold.

The content in this chapter starts off with installing Spark in a cloud cluster. The instructions assume that you have a cluster set up in the configuration developed in Chapter 9 but, as usual, you should be able to adapt the instructions to your specific situation. Later on, the instructions cover running Hive on Spark, and it’s expected that your cluster is set up for Hive as described in Chapter 11.

Planning for Spark in the Cloud

Spark running in a cluster can use any of several execution engines, including its own “standalone” manager and worker processes that can run in an integrated fashion with a Hadoop cluster. However, Spark can use YARN ...

Get Moving Hadoop to the Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.