You are previewing Expert Hadoop 2 Administration: Managing Spark, YARN, and MapReduce.
O'Reilly logo
Expert Hadoop 2 Administration: Managing Spark, YARN, and MapReduce

Book Description

This is the Rough Cut version of the printed book.

Stop searching the web for out-of-date, fragmentary, and unreliable information about running Hadoop! Now, there's a single source for all the authoritative knowledge and trustworthy procedures you need: Expert Hadoop® Administration: Managing Spark, YARN, and HDFS.

Pioneering Hadoop/Big Data administrator Sam R. Alapati shares step-by-step procedures for confidently performing every important task involved in creating, configuring, securing, managing, and optimizing production Hadoop clusters. The only Hadoop administration guide written by a working Hadoop administrator, Expert Hadoop® Administration covers an unmatched range of topics and offers an unparalleled collection of realistic examples. Alapati shares proven answers to complex configuration, management, and performance-tuning problems Hadoop administrators constantly encounter, and expert guidance for customizing Hadoop 2's intensely complex environment. Throughout, he integrates action-oriented advice with carefully researched explanations of both problems and solutions. Coverage includes

  • Indispensable Hadoop concepts, including architecture, clusters, and application frameworks

  • Configuring high-reliability, high-performance Hadoop environments

  • Managing and protecting Hadoop data and high availability, including HDFS management, compression, data formats, and NameNode

  • Moving data, allocating resources, and scheduling jobs with YARN, and managing job workflows with Oozie and Hue

  • Hadoop security, monitoring, logging, and benchmarking

  • Troubleshooting root causes of severe performance slowdowns

  • Preventing trouble by proactively maintaining healthy Hadoop environments

  • Installing Hadoop virtual environments, and more

  • Table of Contents

    1. Contents
    2. 1. Introduction to Hadoop 2 and its Environment
      1. Hadoop 2—An Introduction
      2. Cluster Computing and Hadoop Clusters
      3. Hadoop Components and the Hadoop Ecosphere
      4. What Do Hadoop Administrators Do?
      5. Key Differences between Hadoop 1 and Hadoop 2
      6. Distributed Data Processing: MapReduce and Spark, Hive and Pig
      7. Data Integration: Apache Sqoop, Apache Flume and Apache Kafka
      8. Key Areas of Hadoop Administration
      9. Summary
    3. 2. An Introduction to the Architecture of Hadoop 2
      1. Distributed Computing and Hadoop
      2. Hadoop 2 Architecture
      3. Data Storage – the Hadoop Distributed File System
      4. Data Processing with YARN, the Hadoop Operating System
      5. Summary
    4. 3. Creating and Configuring a Simple Hadoop 2 Cluster
      1. Hadoop Distributions and Installation Types
      2. Setting up a Pseudo-Distributed Hadoop 2 Cluster
      3. Performing the Initial Hadoop Configuration
      4. Operating the New Hadoop Cluster
      5. Summary
    5. 4. Planning for and Creating a Fully Distributed Cluster
      1. Planning your Hadoop Cluster
      2. Going from a Single Rack to Multiple Racks
      3. Creating a Multi-Node Cluster
      4. Modifying the Hadoop Configuration
      5. Starting up the Cluster
      6. Configuring Hadoop Services, Web Interfaces and Ports
      7. Summary
    6. 6. Running Applications in a Cluster – the Spark Framework
      1. What is Spark?
      2. Why Spark?
      3. The Spark Stack
      4. Installing Spark
      5. Spark Run Modes
      6. Understanding the Cluster Managers
      7. Spark and Data Access
      8. Summary
    7. 7. Running Spark Applications
      1. The Spark Programming Model
      2. Spark Applications
      3. Architecture of a Spark Application
      4. Running Spark Applications
      5. Creating and Running Spark Applications
      6. Configuring Spark Applications
      7. Monitoring Spark Applications
      8. Handling Streaming Data with Spark Streaming
      9. Using Spark SQL for Handling Structured Data
      10. Summary
    8. 8. The Role of the NameNode and How HDFS Works
      1. HDFS—The Interaction between the NameNode and the DataNodes
      2. Rack Awareness and Topology
      3. HDFS Data Replication
      4. How Clients Read and Write HDFS Data
      5. Understanding HDFS Recovery Processes
      6. Centralized Cache management in HDFS
      7. Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage)
      8. Summary
    9. 9. HDFS Commands, HDFS Permissions, and HDFS Storage
      1. Managing HDFS through the HDFS Shell Commands
      2. Using the dfsadmin Utility to Perform HDFS Operations
      3. Managing HDFS Permissions and Users
      4. Managing HDFS Storage
      5. Rebalancing HDFS Data
      6. Reclaiming HDFS Space
      7. Summary
    10. 10. Data Protection, Compression and Hadoop Data Formats
      1. Safeguarding Data
      2. Data Compression
      3. Hadoop File Formats
      4. Using Hadoop WebHDFS and HttpFs
      5. Summary
    11. 11. NameNode Operations and High Availability
      1. Understanding NameNode Operations
      2. The Checkpointing Process
      3. Configuring HDFS High Availability
      4. HDFS Federation
      5. Summary
    12. 12. Moving Data into and out of Hadoop
      1. Introduction to Hadoop Data Transfer tools
      2. Loading Data into HDFS from the Command Line
      3. Copying HDFS Data between Clusters with DistCp
      4. Ingesting Data from Relational Databases with Sqoop
      5. Ingesting Data from External Sources with Flume
      6. Summary
    13. 13. Resource Allocation in a Hadoop Cluster
      1. Resource Allocation in Hadoop 2
      2. The FIFO Scheduler
      3. The Capacity Scheduler
      4. The Fair Scheduler
      5. Comparing the Capacity Scheduler and the Fair Scheduler
      6. Summary
    14. 14. Working with Oozie to Manage Job Workflows
      1. Using Apache Oozie to Schedule Jobs
      2. Oozie Architecture
      3. Deploying Oozie in your Cluster
      4. Understanding Oozie Workflows
      5. How Oozie Runs an Action
      6. Creating an Oozie Workflow
      7. Running an Oozie Workflow Job
      8. Oozie Coordinators
      9. Managing and Administering Oozie
      10. Summary
    15. 15. Securing Hadoop
      1. Hadoop Security – an Overview
      2. Hadoop Authentication with Kerberos
      3. Hadoop Authorization
      4. Auditing Hadoop
      5. Securing Hadoop Data
      6. Other Hadoop Related Security Initiatives
      7. Summary