Who says Pigs can’t fly? Knowing how to optimize your Pig Latin scripts can make a significant difference in how they perform. Pig is still a young project and does not have a sophisticated optimizer that can make the right choices. Instead, consistent with Pig’s philosophy of user choice, it relies on you to make these decisions. Beyond just optimizing your scripts, Pig and MapReduce can be tuned to perform better based on your workload. And there are ways to optimize your data layout as well. This chapter covers a number of features you can use to help Pig fly.
Before diving into the details of how to optimize your Pig Latin, it is worth understanding what features tend to create bottlenecks in Pig jobs:
It does not seem that a massively parallel system should be I/O bound. Hadoop’s parallelism reduces I/O but does not entirely remove it. You can always add more map tasks. However, the law of diminishing returns comes into effect. Additional maps take more time to start up, and MapReduce has to find more slots in which to run them. If you have twice as many maps as cluster capacity to run them, it will take twice your average map time to run all of your maps. Adding one more map in that case will actually make things worse because the map time will increase to three times the average. Also, every record that is read might need to be decompressed and will need to be deserialized.
By shuffle size we mean the data that is moved from your ...