The Hadoop Performance Myth

Hadoop is a popular (if not de facto) framework for processing large data sets through distributed computing. YARN allowed Hadoop to evolve from a MapReduce engine to a big data ecosystem that can run heterogeneous (MapReduce and non-MapReduce) applications simultaneously. This results in larger clusters with more users and workloads than ever before. Traditional recommendations encourage provisioning, isolation, and tuning to increase performance and avoid resource contention but result in highly underutilized clusters. Herein we’ll review the challenges of improving performance and utilization for today’s dynamic, multitenant clusters and how emerging tools help when best practices fall short.

The Challenge of Predictable Performance

Hadoop breaks down a large computational problem into tiny, modular pieces across a cluster of commodity hardware. Each computational piece could be run almost anywhere within the cluster and, as a result, could be a little faster or slower based on that machine’s specifications. Hadoop was designed to include redundancy to keep this variability from impacting performance. If a particular task is running slower than expected, Hadoop may launch the same computation on another copy of the target data. Whichever task completes first wins.

Optimizing performance on Hadoop 1.0 wasn’t necessarily easy but had fewer variables to contend with than later versions. Version 1.0 only ran a specific workload (MapReduce) with known ...

Get The Hadoop Performance Myth now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.