You are previewing Optimizing Hadoop for MapReduce.
O'Reilly logo
Optimizing Hadoop for MapReduce

Book Description

This book is the perfect introduction to sophisticated concepts in MapReduce and will ensure you have the knowledge to optimize job performance. This is not an academic treatise; it’s an example-driven tutorial for the real world.

In Detail

MapReduce is the distribution system that the Hadoop MapReduce engine uses to distribute work around a cluster by working parallel on smaller data sets. It is useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation.

This book introduces you to advanced MapReduce concepts and teaches you everything from identifying the factors that affect MapReduce job performance to tuning the MapReduce configuration. Based on real-world experience, this book will help you to fully utilize your cluster’s node resources to run MapReduce jobs optimally.

This book details the Hadoop MapReduce job performance optimization process. Through a number of clear and practical steps, it will help you to fully utilize your cluster’s node resources.

Starting with how MapReduce works and the factors that affect MapReduce performance, you will be given an overview of Hadoop metrics and several performance monitoring tools. Further on, you will explore performance counters that help you identify resource bottlenecks, check cluster health, and size your Hadoop cluster. You will also learn about optimizing map and reduce tasks by using Combiners and compression.

The book ends with best practices and recommendations on how to use your Hadoop cluster optimally.

What You Will Learn

  • Learn about the factors that affect MapReduce performance
  • Utilize the Hadoop MapReduce performance counters to identify resource bottlenecks
  • Size your Hadoop cluster's nodes
  • Set the number of mappers and reducers correctly
  • Optimize mapper and reducer task throughput and code size using compression and Combiners
  • Understand the various tuning properties and best practices to optimize clusters
  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

    Table of Contents

    1. Optimizing Hadoop for MapReduce
      1. Table of Contents
      2. Optimizing Hadoop for MapReduce
      3. Credits
      4. About the Author
      5. Acknowledgments
      6. About the Reviewers
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Errata
          2. Piracy
          3. Questions
      9. 1. Understanding Hadoop MapReduce
        1. The MapReduce model
        2. An overview of Hadoop MapReduce
        3. Hadoop MapReduce internals
        4. Factors affecting the performance of MapReduce
        5. Summary
      10. 2. An Overview of the Hadoop Parameters
        1. Investigating the Hadoop parameters
          1. The mapred-site.xml configuration file
            1. The CPU-related parameters
            2. The disk I/O related parameters
            3. The memory-related parameters
            4. The network-related parameters
          2. The hdfs-site.xml configuration file
          3. The core-site.xml configuration file
        2. Hadoop MapReduce metrics
        3. Performance monitoring tools
          1. Using Chukwa to monitor Hadoop
          2. Using Ganglia to monitor Hadoop
          3. Using Nagios to monitor Hadoop
          4. Using Apache Ambari to monitor Hadoop
        4. Summary
      11. 3. Detecting System Bottlenecks
        1. Performance tuning
        2. Creating a performance baseline
        3. Identifying resource bottlenecks
          1. Identifying RAM bottlenecks
          2. Identifying CPU bottlenecks
          3. Identifying storage bottlenecks
          4. Identifying network bandwidth bottlenecks
        4. Summary
      12. 4. Identifying Resource Weaknesses
        1. Identifying cluster weakness
          1. Checking the Hadoop cluster node's health
          2. Checking the input data size
          3. Checking massive I/O and network traffic
          4. Checking for insufficient concurrent tasks
          5. Checking for CPU contention
        2. Sizing your Hadoop cluster
        3. Configuring your cluster correctly
        4. Summary
      13. 5. Enhancing Map and Reduce Tasks
        1. Enhancing map tasks
          1. Input data and block size impact
          2. Dealing with small and unsplittable files
          3. Reducing spilled records during the Map phase
          4. Calculating map tasks' throughput
        2. Enhancing reduce tasks
          1. Calculating reduce tasks' throughput
          2. Improving Reduce execution phase
        3. Tuning map and reduce parameters
        4. Summary
      14. 6. Optimizing MapReduce Tasks
        1. Using Combiners
        2. Using compression
        3. Using appropriate Writable types
        4. Reusing types smartly
        5. Optimizing mappers and reducers code
        6. Summary
      15. 7. Best Practices and Recommendations
        1. Hardware tuning and OS recommendations
          1. The Hadoop cluster checklist
          2. The Bios tuning checklist
          3. OS configuration recommendations
        2. Hadoop best practices and recommendations
          1. Deploying Hadoop
          2. Hadoop tuning recommendations
          3. Using a MapReduce template class code
        3. Summary
      16. Index