Chapter 8. Segue
Welcome to the last of the book’s recipes for R parallelism. This will be a short chapter, but don’t let that fool you: Segue’s scope is intentionally narrow. This focus makes it a particularly powerful tool.
Segue’s mission is as simple as it gets: make it easy to use Elastic
MapReduce as a parallel backend for lapply()
-style operations. So easy, in fact, that
it boasts of doing this in only two lines of R code.[59]
This narrow focus is no accident. Segue’s creator, JD Long, wanted occasional access to a Hadoop cluster to run his pleasantly parallel,[60] computationally expensive models. Elastic MapReduce was a great fit but still a bit cumbersome for his workflow. He created Segue to tackle the grunt work so he could focus on his higher-level modeling tasks.
Segue is a relatively young package. Nonetheless, since its creation in 2010, it has attracted a fair amount of attention.
Quick Look
Motivation: You want Hadoop power
to drive some lapply()
loops, perhaps
for a parameter sweep, but you want minimal Hadoop contact. You consider
MapReduce to be too much of a distraction from your work.
Solution: Use the segue
package’s emrlapply()
to send your calculations up to
Elastic MapReduce, the Amazon Web Services cloud-based Hadoop
product.
Good because: You get to focus on
your modelling work, while segue
takes
care of transforming your lapply()
work
into a Hadoop job.
How It Works
Segue takes care of launching the Elastic MapReduce cluster, shipping data back and forth, and ...
Get Parallel R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.