Chapter 6. Grouping Operations

Some content contributed by Q. Ethan McCallum (@qethanm)

In this chapter, we will introduce grouping operations in Pig and MapReduce. We’ll teach you the schemas behind grouped data, how to inspect and sample grouped data relations, how to count records in groups, and how to use aggregate functions to calculate arbitrary statistics about groups. We’ll teach you to describe and summarize individual records, fields, or entire data tables. In so doing, we’ll explore questions such as, “Does God hate Cleveland?” and “Who are the best players for each phase of their career?”

The GROUP BY operation is fundamental to data processing, both in MapReduce and in the world of SQL. In this chapter, we will cover grouping operations in Pig, which are one-liners, or one line of Pig code to perform. This is part of Pig’s power. We’ll learn how grouping operations relate to the reduce phase of MapReduce and how to combine map-only operations with GROUP BY operations to perform arbitrary operations on data relations.

Grouping operations are at the heart of MapReduce—they make use of and define the reduce operation of MapReduce, in which records with the same reduce key are grouped on a single reducer in sorted order. Thus it is possible to define a single MapReduce job that performs any number of map-only operations, followed by a grouping operation, followed by more map-only operations after the reduce. This simple pattern enables MapReduce to ...

Get Big Data for Chimps now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.