Chapter 9. Distributed Computation

9.0. Introduction

With the advent of cheaper and cheaper storage, we’re inclined to store more and more data. As this data grows larger and larger, it becomes increasingly difficult to utilize to its full potential. In response, numerous new techniques have emerged in the last decade or so for dealing with such quantities of data.

The primary focus of this chapter is one such technique, MapReduce, developed at Google in the early 2000s. Functional even in name, this technique uses map and reduce in parallel across multiple machines at tremendous scale to process data at phenomenal speeds. In this chapter, we’ll be covering Cascalog, a data-processing library built on top of Hadoop, which is an open source MapReduce implementation.

We’ll also briefly cover Storm, a real-time stream-processing library in use at several tech giants such as Twitter, Groupon, and Yahoo!.

Cascalog

Cascalog defines a DSL based on Datalog, the same query language that backs Datomic. It might seem strange at first, but you will be thinking in Datalog in no time. Once you’ve wet your feet with these recipes, visit the Cascalog wiki for more information on writing your own queries.

Cascalog provides a concise syntax for describing data-processing jobs. Transformations and aggregates are easy to express in Cascalog. Joins are particularly simple. You might like the Cascalog syntax so much that you use it even for local jobs.

You can run your Cascalog jobs in a number of different ...

Get Clojure Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.