Welcome to the last of the book’s recipes for R parallelism. This will be a short chapter, but don’t let that fool you: Segue’s scope is intentionally narrow. This focus makes it a particularly powerful tool.
Segue’s mission is as simple as it gets: make it easy to use Elastic
MapReduce as a parallel backend for
lapply()-style operations. So easy, in fact, that
it boasts of doing this in only two lines of R code.
This narrow focus is no accident. Segue’s creator, JD Long, wanted occasional access to a Hadoop cluster to run his pleasantly parallel, computationally expensive models. Elastic MapReduce was a great fit but still a bit cumbersome for his workflow. He created Segue to tackle the grunt work so he could focus on his higher-level modeling tasks.
Segue is a relatively young package. Nonetheless, since its creation in 2010, it has attracted a fair amount of attention.
Motivation: You want Hadoop power
to drive some
lapply() loops, perhaps
for a parameter sweep, but you want minimal Hadoop contact. You consider
MapReduce to be too much of a distraction from your work.
Solution: Use the
emrlapply() to send your calculations up to
Elastic MapReduce, the Amazon Web Services cloud-based Hadoop
Good because: You get to focus on
your modelling work, while
care of transforming your
into a Hadoop job.
Segue takes care of launching the Elastic MapReduce cluster, shipping data back and forth, and ...