Writing MapReduce programs

In this chapter, we will be focusing on batch workloads; given a set of historical data, we will look at properties of that dataset. In Chapter 4, Real-time Computation with Samza, and Chapter 5, Iterative Computation with Spark, we will show how a similar type of analysis can be performed over a stream of text collected in real time.

Getting started

In the following examples, we will assume a dataset generated by collecting 1,000 tweets using the stream.py script, as shown in Chapter 1, Introduction:

$ python stream.py –t –n 1000 > tweets.txt

We can then copy the dataset into HDFS with:

$ hdfs dfs -put tweets.txt <destination>

Tip

Note that until now we have been working only with the text of tweets. In the remainder of ...

Get Learning Hadoop 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.