Chapter 8. Ordering Operations
In this chapter, we will cover ordering operations, or operations that sort data according to some criteria. Pig has two concepts of order: entire datasets can be sorted, as can the contents of a bag. Weâll learn how to sort relations and bags, and also how to calculate the top records of a relation by combining ORDER
with LIMIT
. With these skills in hand, weâll be one step closer to being able to solve any arbitrary data-processing task using the set of patterns weâve learned.
Ordering operations are a fundamental part of storytelling. A big part of telling stories with data is coming up with examples that prove a point. This means diving into the data to produce the most exceptional records. When data is big, this invariably means you need to sort the data to pick up the highest or lowest value(s) of some metric.
So far weâve mostly limited ourselves to the ordering inherently provided by the shuffle/sort phase of MapReduce, which does provide a sorted list on the reduce key for each file. If weâre running a small job with a single reducer, that does provide a total sort. However, if we want an overall sort using multiple reducers (as we must, if weâre working with big data), we must employ Pigâs ORDER
command. Letâs begin!
Preparing Career Epochs
In order to demonstrate ordering records, weâre going to prepare a dataset detailing the performance of players at three phases of their career: young, prime, and older. To do so, weâll ...
Get Big Data for Chimps now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.