Hadoop is an open-source framework for large-scale data storage and distributed computing, built on the MapReduce model. Doug Cutting initially created Hadoop as a component of the Nutch web crawler. It became its own project in 2006, and graduated to a top-level Apache project in 2008. During this time, Hadoop has experienced widespread adoption.
One of Hadoop’s strengths is that it is a general framework, applicable to a variety of domains and programming languages. One use case, and the common thread of the book’s remaining chapters, is to drive large R jobs.
This chapter explains some basics of MapReduce and Hadoop. It may feel a little out of place, as it’s not specific to R; but the content is too important to hide in an appendix.
Have no fear: I don’t dive into deep details here. There is a lot more to MapReduce and Hadoop than I could possibly cover in this book, let alone a chapter. I’ll provide just enough guidance to set you on your way. For a more thorough exploration I encourage you to read the Google MapReduce paper mentioned in , as well as Hadoop: The Definitive Guide by Tom White (O’Reilly).
If you already have a grasp on MapReduce and Hadoop, feel free to skip to the next chapter.
When people think “Apache Hadoop,” they often think about churning through terabytes of input across clusters made of tens or hundreds of machines, or nodes. Logfile processing is such an oft-cited use case, in fact, ...