Chapter 7. RHIPE

This chapter is a guide to Saptarshi Guha’s RHIPE package, the R and Hadoop Integrated Processing Environment. RHIPE’s development history dates back to 2009 and it is still actively maintained by the original author.

Compared to R+Hadoop, RHIPE abstracts you from raw Hadoop but still requires an understanding of the MapReduce model.

Since you covered a lot of MapReduce and Hadoop details in the previous two chapters, this chapter will have a very short route to the examples.

Quick Look

Motivation: You like the power of MapReduce, as explained in the previous chapter, but you want something a little more R-centric.

Solution: Use the RHIPE R package as your Hadoop emissary. Even though you’ll still have to understand MapReduce, you won’t have to directly touch Hadoop.

Good because: You get Hadoop’s power without leaving the comfy confines of R’s language and interactive shell. (RHIPE even includes tools to work with HDFS.) This means you can MapReduce through a mountain of data during an interactive session of exploratory analysis.

How It Works

RHIPE sits between you and Hadoop. You write your Map and Reduce functions as R code, and RHIPE handles the scut work of invoking Hadoop commands.

To give you a quick example, here’s a typical RHIPE call:

rhipe.job.def <- rhmr(
        map= ... block of R code for Mapper
        reduce= ... block of R code for Reducer
        ifolder="/path/to/input" ,
        ofolder="/path/to/output" ,
        ... a couple other RHIPE options
)

rhex( rhipe.job.ref )

That’s it! There’s no ...

Get Parallel R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.