Apache Crunch is a
higher-level API for writing MapReduce pipelines. The main advantages it offers
over plain MapReduce are its focus on programmer-friendly Java types like
String and plain old Java objects, a richer set of
data transformation operations, and multistage pipelines (no need to
explicitly manage individual MapReduce jobs in a workflow).
In these respects, Crunch looks a lot like a Java version of Pig. One day-to-day source of friction in using Pig, which Crunch avoids, is that the language used to write user-defined functions (Java or Python) is different from the language used to write Pig scripts (Pig Latin), which makes for a disjointed development experience as one switches between the two different representations and languages. By contrast, Crunch programs and UDFs are written in a single language (Java or Scala), and UDFs can be embedded right in the programs. The overall experience feels very like writing a non-distributed program. Although it has many parallels with Pig, Crunch was inspired by FlumeJava, the Java library developed at Google for building MapReduce pipelines.
FlumeJava is not to be confused with Apache Flume, covered in Chapter 14, which is a system for collecting streaming event data. You can read more about FlumeJava in “FlumeJava: Easy, Efﬁcient Data-Parallel Pipelines” by Craig Chambers et al.
Because they are high level, Crunch pipelines are highly composable and common functions can be extracted into libraries and reused ...