Chapter 8. Segue

Welcome to the last of the book’s recipes for R parallelism. This will be a short chapter, but don’t let that fool you: Segue’s scope is intentionally narrow. This focus makes it a particularly powerful tool.

Segue’s mission is as simple as it gets: make it easy to use Elastic MapReduce as a parallel backend for lapply()-style operations. So easy, in fact, that it boasts of doing this in only two lines of R code.[59]

This narrow focus is no accident. Segue’s creator, JD Long, wanted occasional access to a Hadoop cluster to run his pleasantly parallel,[60] computationally expensive models. Elastic MapReduce was a great fit but still a bit cumbersome for his workflow. He created Segue to tackle the grunt work so he could focus on his higher-level modeling tasks.

Segue is a relatively young package. Nonetheless, since its creation in 2010, it has attracted a fair amount of attention.

Quick Look

Motivation: You want Hadoop power to drive some lapply() loops, perhaps for a parameter sweep, but you want minimal Hadoop contact. You consider MapReduce to be too much of a distraction from your work.

Solution: Use the segue package’s emrlapply() to send your calculations up to Elastic MapReduce, the Amazon Web Services cloud-based Hadoop product.

Good because: You get to focus on your modelling work, while segue takes care of transforming your lapply() work into a Hadoop job.

How It Works

Segue takes care of launching the Elastic MapReduce cluster, shipping data back and forth, and ...

Get Parallel R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.