MapReduce output

The output is dependent on the number of Reduce tasks present in the job. Some guidelines to optimize outputs are as follows:

  • Compress outputs to save on storage. Compression also helps in increasing HDFS write throughput.
  • Avoid writing out-of-band side files as outputs in the Reduce task. If statistical data needs to be collected, the use of Counters is better. Collecting statistics in side files would require an additional step of aggregation.
  • Depending on the consumer of the output files of a job, a splittable compression technique could be appropriate.
  • Writing large HDFS files with larger block sizes can help subsequent consumers of the data reduce their Map tasks. This is particularly useful when we cascade MapReduce jobs. In ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.