Writing MapReduce programs

In this chapter, we will be focusing on batch workloads; given a set of historical data, we will look at properties of that dataset. In Chapter 4, Real-time Computation with Samza, and Chapter 5, Iterative Computation with Spark, we will show how a similar type of analysis can be performed over a stream of text collected in real time.

Getting started

In the following examples, we will assume a dataset generated by collecting 1,000 tweets using the stream.py script, as shown in Chapter 1, Introduction:

$ python stream.py –t –n 1000 > tweets.txt

We can then copy the dataset into HDFS with:

$ hdfs dfs -put tweets.txt <destination>

Tip

Note that until now we have been working only with the text of tweets. In the remainder of ...

Get Learning Hadoop 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Learning Hadoop 2 by Garry Turkington, Gabriele Modena

Writing MapReduce programs

Getting started

Tip

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly