Secondary sorting – sorting Reduce input values

MapReduce frameworks sort the Reduce input data based on the key of the key-value pairs and also group the data based on the key. Hadoop invokes the reduce function for each unique key in the sorted order of keys with the list of values belonging to that key as the second parameter. However, the list of values for each key is not sorted in any particular order. There are many scenarios where it would be useful to have the list of Reduce input values for each key sorted based on some criteria as well. The examples include finding the maximum or minimum value for a given key without iterating the whole list, to optimize Reduce-side joins, to identify duplicate data products, and so on.

We can use the ...

Get Hadoop MapReduce v2 Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.