In Chapter 2, we introduced the MapReduce model. In this chapter, we look at the practical aspects of developing a MapReduce application in Hadoop.
Writing a program in MapReduce has a certain flow to it. You start by writing your map and reduce functions, ideally with unit tests to make sure they do what you expect. Then you write a driver program to run a job, which can run from your IDE using a small subset of the data to check that it is working. If it fails, then you can use your IDE’s debugger to find the source of the problem. With this information, you can expand your unit tests to cover this case, and improve your mapper or reducer as appropriate to handle such input correctly.
When the program runs as expected against the small dataset, you are
ready to unleash it on a cluster. Running against the full dataset is
likely to expose some more issues, which you can fix as before, by
expanding your tests and mapper or reducer to handle the new cases.
Debugging failing programs in the cluster is a challenge, but Hadoop
provides some tools to help, such as an
IsolationRunner, which allows you to run a task
over the same input on which it failed, with a debugger attached, if
After the program is working, you may wish to do some tuning, first by running through some standard checks for making MapReduce programs faster and then by doing task profiling. Profiling distributed programs is not trivial, but Hadoop has hooks to aid ...