In this chapter, we will be focusing on batch workloads; given a set of historical data, we will look at properties of that dataset. In Chapter 4, Real-time Computation with Samza, and Chapter 5, Iterative Computation with Spark, we will show how a similar type of analysis can be performed over a stream of text collected in real time.
In the following examples, we will assume a dataset generated by collecting 1,000 tweets using the
stream.py script, as shown in Chapter 1, Introduction:
$ python stream.py –t –n 1000 > tweets.txt
We can then copy the dataset into HDFS with:
$ hdfs dfs -put tweets.txt <destination>
Note that until now we have been working only with the text of tweets. In the remainder of ...