With the advent of cheaper and cheaper storage, we’re inclined to store more and more data. As this data grows larger and larger, it becomes increasingly difficult to utilize to its full potential. In response, numerous new techniques have emerged in the last decade or so for dealing with such quantities of data.
The primary focus of this chapter is one such technique,
developed at Google in the early 2000s. Functional even in name, this
reduce in parallel across multiple machines at
tremendous scale to process data at phenomenal speeds. In this chapter,
we’ll be covering Cascalog, a data-processing
library built on top of Hadoop, which is an open source MapReduce
We’ll also briefly cover Storm, a real-time stream-processing library in use at several tech giants such as Twitter, Groupon, and Yahoo!.
Cascalog defines a DSL based on Datalog, the same query language that backs Datomic. It might seem strange at first, but you will be thinking in Datalog in no time. Once you’ve wet your feet with these recipes, visit the Cascalog wiki for more information on writing your own queries.
Cascalog provides a concise syntax for describing data-processing jobs. Transformations and aggregates are easy to express in Cascalog. Joins are particularly simple. You might like the Cascalog syntax so much that you use it even for local jobs.
You can run your Cascalog jobs in a number of different ...