Chapter 11
Scalable Parallel Processing with MapReduce
WHAT’S IN THIS CHAPTER?
- Understanding the challenges of scalable parallel processing
- Leveraging MapReduce for large scale parallel processing
- Exploring the concepts and nuances of the MapReduce computational model
- Getting hands-on MapReduce experience using MongoDB, CouchDB, and HBase
- Introducing Mahout, a MapReduce-based machine learning infrastructure
Manipulating large amounts of data requires tools and methods that can run operations in parallel with as few as possible points of intersection among them. Fewer points of intersection lead to fewer potential conflicts and less management. Such parallel processing tools also need to keep data transfer to a minimum. I/O and bandwidth can often become bottlenecks that impede fast and efficient processing. With large amounts of data the I/O bottlenecks can be amplified and can potentially slow down a system to a point where it becomes impractical to use it. Therefore, for large-scale computations, keeping data local to a computation is of immense importance. Given these considerations, manipulating large data sets spread out across multiple machines is neither trivial nor easy.
Over the years, many methods have been developed to compute large data sets. Initially, innovation was focused around building super computers. Super computers are meant to be super-powerful machines with greater-than-normal processing capabilities. These machines work well for specific and complicated ...