This chapter looks at some of the more advanced features of MapReduce, including counters and sorting and joining datasets.
There are often things you would like to know about the data you are analyzing but which are peripheral to the analysis you are performing. For example, if you were counting invalid records, and discovered that the proportion of invalid records in the whole dataset was very high, you might be prompted to check why so many records were being marked as invalid—perhaps there is a bug in the part of the program that detects invalid records? Or if the data were of poor quality and genuinely did have very many invalid records, after discovering this, you might decide to increase the size of the dataset so that the number of good records was large enough for meaningful analysis.
Counters are a useful channel for gathering statistics about the job: for quality control, or for application level-statistics. They are also useful for problem diagnosis. If you are tempted to put a log message into your map or reduce task, then it is often better to see whether you can use a counter instead to record that a particular condition occurred. In addition to counter values being much easier to retrieve than log output for large distributed jobs, you get a record of the number of times that condition occurred, which is more work to obtain from a set of logfiles.
Hadoop maintains some built-in counters for every job (Table 8-1 ...