O'Reilly logo

Learning Hadoop 2 by Garry Turkington, Gabriele Modena

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Writing MapReduce programs

In this chapter, we will be focusing on batch workloads; given a set of historical data, we will look at properties of that dataset. In Chapter 4, Real-time Computation with Samza, and Chapter 5, Iterative Computation with Spark, we will show how a similar type of analysis can be performed over a stream of text collected in real time.

Getting started

In the following examples, we will assume a dataset generated by collecting 1,000 tweets using the stream.py script, as shown in Chapter 1, Introduction:

$ python stream.py –t –n 1000 > tweets.txt

We can then copy the dataset into HDFS with:

$ hdfs dfs -put tweets.txt <destination>

Tip

Note that until now we have been working only with the text of tweets. In the remainder of ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required