Chapter 3. A Framework for Python and Hadoop Streaming

The current version of Hadoop MapReduce is a software framework for composing jobs that process large amounts of data in parallel on a cluster, and is the native distributed processing framework that ships with Hadoop. The framework exposes a Java API that allows developers to specify input and output locations on HDFS, map and reduce functions, and other job parameters as a job configuration. Jobs are compiled and packaged into a JAR, which is submitted to the ResourceManager by the job client—usually via the command line. The ResourceManager then schedules tasks, monitors them, and provides status back to the client.

Typically, a MapReduce application is composed of three Java classes: a Job, a Mapper, and a Reducer. Mappers and reducers handle the details of computation on key/value pairs and are connected through a shuffle and sort phase. The Job configures the input and output data format by specifying the InputFormat and OutputFormat classes of data being serialized to and from HDFS. All of these classes must extend abstract base classes or implement MapReduce interfaces. Needless to say, developing a Java MapReduce application is verbose.

However, Java is not the only option to use the MapReduce framework! For example, C++ developers can use Hadoop Pipes, which provides an API for using both HDFS and MapReduce. But what is of most interest to data scientists is Hadoop Streaming, a utility written in Java that allows ...

Get Data Analytics with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.