Apache Crunch

Apache Crunch (http://crunch.apache.org) is a Java and Scala library to create pipelines of MapReduce jobs. It is based on Google's FlumeJava (http://dl.acm.org/citation.cfm?id=1806638) paper and library. The project goal is to make the task of writing MapReduce jobs as straightforward as possible for anybody familiar with the Java programming language by exposing a number of patterns that implement operations such as aggregating, joining, filtering, and sorting records.

Similar to tools such as Pig, Crunch pipelines are created by composing immutable, distributed data structures and running all processing operations on such structures; they are expressed and implemented as user-defined functions. Pipelines are compiled into a DAG ...

Get Learning Hadoop 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.