WHAT’S IN THIS CHAPTER?
- Understanding the factors that affect parallel scalable applications
- Optimizing scalable processing, especially when it leverages the MapReduce model for processing
- Presenting a set of best practices for parallel processing
- Illustrating a few Hadoop performance tuning tips
Today, much of the big data analysis in the world of NoSQL rests on the shoulders of the MapReduce model of processing. Hadoop is built on it and each NoSQL product supporting huge data sizes leverages it. This chapter is a first look into optimizing scalable applications and tuning the way MapReduce-style processing works on large data sets. By no means does the chapter provide a prescriptive solution. Instead, it provides a few important concepts and good practices to bear in mind when optimizing a scalable parallel application. Each optimization problem is unique to its requirements and context and so providing one universally applicable general solution is probably not feasible.
GOALS OF PARALLEL ALGORITHMS
MapReduce makes scalable parallel processing easier than it had been in the past. By adhering to a model where data is not shared between parallel threads or processes, MapReduce creates a bottleneck-free way of scaling out as workloads increase. The underlying goal at all times is to reduce latency and increase throughput.
The Implications of Reducing Latency
Reducing latency simply means reducing the execution time of a program. The faster a program ...