You are previewing Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem.
O'Reilly logo
Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

Book Description

Get Started Fast with Apache Hadoop® 2, YARN, and Today’s Hadoop Ecosystem

With Hadoop 2.x and YARN, Hadoop moves beyond MapReduce to become practical for virtually any type of data processing. Hadoop 2.x and the Data Lake concept represent a radical shift away from conventional approaches to data usage and storage. Hadoop 2.x installations offer unmatched scalability and breakthrough extensibility that supports new and existing Big Data analytics processing methods and models.

Hadoop® 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers, and to navigate the powerful technologies that complement it.

Eadline concisely introduces and explains every key Hadoop 2 concept, tool, and service, illustrating each with a simple “beginning-to-end” example and identifying trustworthy, up-to-date resources for learning more.

This guide is ideal if you want to learn about Hadoop 2 without getting mired in technical details. Douglas Eadline will bring you up to speed quickly, whether you’re a user, admin, devops specialist, programmer, architect, analyst, or data scientist.

Coverage Includes

  • Understanding what Hadoop 2 and YARN do, and how they improve on Hadoop 1 with MapReduce

  • Understanding Hadoop-based Data Lakes versus RDBMS Data Warehouses

  • Installing Hadoop 2 and core services on Linux machines, virtualized sandboxes, or clusters

  • Exploring the Hadoop Distributed File System (HDFS)

  • Understanding the essentials of MapReduce and YARN application programming

  • Simplifying programming and data movement with Apache Pig, Hive, Sqoop, Flume, Oozie, and HBase

  • Observing application progress, controlling jobs, and managing workflows

  • Managing Hadoop efficiently with Apache Ambari–including recipes for HDFS to NFSv3 gateway, HDFS snapshots, and YARN configuration

  • Learning basic Hadoop 2 troubleshooting, and installing Apache Hue and Apache Spark

  • Table of Contents

    1. About This E-Book
    2. Title Page
    3. Copyright Page
    4. Contents
    5. Foreword
    6. Preface
      1. Focus of the Book
      2. Who Should Read This Book
      3. Book Structure
        1. Book Conventions
        2. Accompanying Code
    7. Acknowledgments
    8. About the Author
    9. 1. Background and Concepts
      1. Defining Apache Hadoop
      2. A Brief History of Apache Hadoop
      3. Defining Big Data
      4. Hadoop as a Data Lake
      5. Using Hadoop: Administrator, User, or Both
      6. First There Was MapReduce
        1. Apache Hadoop Design Principles
        2. Apache Hadoop MapReduce Example
        3. MapReduce Advantages
        4. Apache Hadoop V1 MapReduce Operation
      7. Moving Beyond MapReduce with Hadoop V2
        1. Hadoop V2 YARN Operation Design
      8. The Apache Hadoop Project Ecosystem
      9. Summary and Additional Resources
    10. 2. Installation Recipes
      1. Core Hadoop Services
        1. Hadoop Configuration Files
      2. Planning Your Resources
        1. Hardware Choices
        2. Software Choices
      3. Installing on a Desktop or Laptop
        1. Installing Hortonworks HDP 2.2 Sandbox
        2. Installing Hadoop from Apache Sources
      4. Installing Hadoop with Ambari
        1. Performing an Ambari Installation
        2. Undoing the Ambari Install
      5. Installing Hadoop in the Cloud Using Apache Whirr
        1. Step 1: Install Whirr
        2. Step 2: Configure Whirr
        3. Step 3: Launch the Cluster
        4. Step 4: Take Down Your Cluster
      6. Summary and Additional Resources
    11. 3. Hadoop Distributed File System Basics
      1. Hadoop Distributed File System Design Features
      2. HDFS Components
        1. HDFS Block Replication
        2. HDFS Safe Mode
        3. Rack Awareness
        4. NameNode High Availability
        5. HDFS NameNode Federation
        6. HDFS Checkpoints and Backups
        7. HDFS Snapshots
        8. HDFS NFS Gateway
      3. HDFS User Commands
        1. Brief HDFS Command Reference
        2. General HDFS Commands
        3. List Files in HDFS
        4. Make a Directory in HDFS
        5. Copy Files to HDFS
        6. Copy Files from HDFS
        7. Copy Files within HDFS
        8. Delete a File within HDFS
        9. Delete a Directory in HDFS
        10. Get an HDFS Status Report
      4. HDFS Web GUI
      5. Using HDFS in Programs
        1. HDFS Java Application Example
        2. HDFS C Application Example
      6. Summary and Additional Resources
    12. 4. Running Example Programs and Benchmarks
      1. Running MapReduce Examples
        1. Listing Available Examples
        2. Running the Pi Example
        3. Using the Web GUI to Monitor Examples
      2. Running Basic Hadoop Benchmarks
        1. Running the Terasort Test
        2. Running the TestDFSIO Benchmark
        3. Managing Hadoop MapReduce Jobs
      3. Summary and Additional Resources
    13. 5. Hadoop MapReduce Framework
      1. The MapReduce Model
      2. MapReduce Parallel Data Flow
      3. Fault Tolerance and Speculative Execution
        1. Speculative Execution
        2. Hadoop MapReduce Hardware
      4. Summary and Additional Resources
    14. 6. MapReduce Programming
      1. Compiling and Running the Hadoop WordCount Example
      2. Using the Streaming Interface
      3. Using the Pipes Interface
      4. Compiling and Running the Hadoop Grep Chaining Example
      5. Debugging MapReduce
        1. Listing, Killing, and Job Status
        2. Hadoop Log Management
      6. Summary and Additional Resources
    15. 7. Essential Hadoop Tools
      1. Using Apache Pig
        1. Pig Example Walk-Through
      2. Using Apache Hive
        1. Hive Example Walk-Through
        2. A More Advanced Hive Example
      3. Using Apache Sqoop to Acquire Relational Data
        1. Apache Sqoop Import and Export Methods
        2. Apache Sqoop Version Changes
        3. Sqoop Example Walk-Through
      4. Using Apache Flume to Acquire Data Streams
        1. Flume Example Walk-Through
      5. Manage Hadoop Workflows with Apache Oozie
        1. Oozie Example Walk-Through
      6. Using Apache HBase
        1. HBase Data Model Overview
        2. HBase Example Walk-Through
      7. Summary and Additional Resources
    16. 8. Hadoop YARN Applications
      1. YARN Distributed-Shell
      2. Using the YARN Distributed-Shell
        1. A Simple Example
        2. Using More Containers
        3. Distributed-Shell Examples with Shell Arguments
      3. Structure of YARN Applications
      4. YARN Application Frameworks
        1. Distributed-Shell
        2. Hadoop MapReduce
        3. Apache Tez
        4. Apache Giraph
        5. Hoya: HBase on YARN
        6. Dryad on YARN
        7. Apache Spark
        8. Apache Storm
        9. Apache REEF: Retainable Evaluator Execution Framework
        10. Hamster: Hadoop and MPI on the Same Cluster
        11. Apache Flink: Scalable Batch and Stream Data Processing
        12. Apache Slider: Dynamic Application Management
      5. Summary and Additional Resources
    17. 9. Managing Hadoop with Apache Ambari
      1. Quick Tour of Apache Ambari
        1. Dashboard View
        2. Services View
        3. Hosts View
        4. Admin View
        5. Views View
        6. Admin Pull-Down Menu
      2. Managing Hadoop Services
      3. Changing Hadoop Properties
      4. Summary and Additional Resources
    18. 10. Basic Hadoop Administration Procedures
      1. Basic Hadoop YARN Administration
        1. Decommissioning YARN Nodes
        2. YARN WebProxy
        3. Using the JobHistoryServer
        4. Managing YARN Jobs
        5. Setting Container Memory
        6. Setting Container Cores
        7. Setting MapReduce Properties
      2. Basic HDFS Administration
        1. The NameNode User Interface
        2. Adding Users to HDFS
        3. Perform an FSCK on HDFS
        4. Balancing HDFS
        5. HDFS Safe Mode
        6. Decommissioning HDFS Nodes
        7. SecondaryNameNode
        8. HDFS Snapshots
        9. Configuring an NFSv3 Gateway to HDFS
      3. Capacity Scheduler Background
      4. Hadoop Version 2 MapReduce Compatibility
        1. Enabling ApplicationMaster Restarts
        2. Calculating the Capacity of a Node
        3. Running Hadoop Version 1 Applications
      5. Summary and Additional Resources
    19. A. Book Webpage and Code Download
    20. B. Getting Started Flowchart and Troubleshooting Guide
      1. Getting Started Flowchart
      2. General Hadoop Troubleshooting Guide
        1. Rule 1: Don’t Panic
        2. Rule 2: Install and Use Ambari
        3. Rule 3: Check the Logs
        4. Rule 4: Simplify the Situation
        5. Rule 5: Ask the Internet
        6. Other Helpful Tips
    21. C. Summary of Apache Hadoop Resources by Topic
      1. General Hadoop Information
      2. Hadoop Installation Recipes
      3. HDFS
      4. Examples
      5. MapReduce
      6. MapReduce Programming
      7. Essential Tools
      8. YARN Application Frameworks
      9. Ambari Administration
      10. Basic Hadoop Administration
    22. D. Installing the Hue Hadoop GUI
      1. Hue Installation
        1. Steps Performed with Ambari
        2. Install and Configure Hue
      2. Starting Hue
      3. Hue User Interface
    23. E. Installing Apache Spark
      1. Spark Installation on a Cluster
      2. Starting Spark across the Cluster
      3. Installing and Starting Spark on the Pseudo-distributed Single-Node Installation
      4. Run Spark Examples
    24. Index
    25. Code Snippets