You are previewing Learning YARN.
O'Reilly logo
Learning YARN

Book Description

Moving beyond MapReduce - learn resource management and big data processing using YARN

About This Book

  • Deep dive into YARN components, schedulers, life cycle management and security architecture

  • Create your own Hadoop-YARN applications and integrate big data technologies with YARN

  • Step-by-step guide to provision, manage, and monitor Hadoop-YARN clusters with ease

  • Who This Book Is For

    This book is intended for those who want to understand what YARN is and how to efficiently use it for the resource management of large clusters. For cluster administrators, this book gives a detailed explanation of provisioning and managing YARN clusters. If you are a Java developer or an open source contributor, this book will help you to drill down the YARN architecture, write your own YARN applications and understand the application execution phases. This book will also help big data engineers explore YARN integration with real-time analytics technologies such as Spark and Storm.

    What You Will Learn

  • Explore YARN features and offerings

  • Manage big data clusters efficiently using the YARN framework

  • Create single as well as multi-node Hadoop-YARN clusters on Linux machines

  • Understand YARN components and their administration

  • Gain insights into application execution flow over a YARN cluster

  • Write your own distributed application and execute it over YARN cluster

  • Work with schedulers and queues for efficient scheduling of applications

  • Integrate big data projects like Spark and Storm with YARN

  • In Detail

    Today enterprises generate huge volumes of data. In order to provide effective services and to make smarter and more intelligent decisions from these huge volumes of data, enterprises use big-data analytics. In recent years, Hadoop has been used for massive data storage and efficient distributed processing of data. The Yet Another Resource Negotiator (YARN) framework solves the design problems related to resource management faced by the Hadoop 1.x framework by providing a more scalable, efficient, flexible, and highly available resource management framework for distributed data processing.

    This book starts with an overview of the YARN features and explains how YARN provides a business solution for growing big data needs. You will learn to provision and manage single, as well as multi-node, Hadoop-YARN clusters in the easiest way. You will walk through the YARN administration, life cycle management, application execution, REST APIs, schedulers, security framework and so on. You will gain insights about the YARN components and features such as ResourceManager, NodeManager, ApplicationMaster, Container, Timeline Server, High Availability, Resource Localisation and so on.

    The book explains Hadoop-YARN commands and the configurations of components and explores topics such as High Availability, Resource Localization and Log aggregation. You will then be ready to develop your own ApplicationMaster and execute it over a Hadoop-YARN cluster.

    Towards the end of the book, you will learn about the security architecture and integration of YARN with big data technologies like Spark and Storm. This book promises conceptual as well as practical knowledge of resource management using YARN.

    Style and approach

    Starting with the basics and covering the core concepts with the practical usage, this tutorial is a complete guide to learn and explore YARN offerings.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Learning YARN
      1. Table of Contents
      2. Learning YARN
      3. Credits
      4. About the Authors
      5. Acknowledgments
      6. About the Reviewers
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      9. 1. Starting with YARN Basics
        1. Introduction to MapReduce v1
        2. Shortcomings of MapReducev1
        3. An overview of YARN components
          1. ResourceManager
          2. NodeManager
          3. ApplicationMaster
          4. Container
        4. The YARN architecture
        5. How YARN satisfies big data needs
        6. Projects powered by YARN
        7. Summary
      10. 2. Setting up a Hadoop-YARN Cluster
        1. Starting with the basics
          1. Supported platforms
          2. Hardware requirements
          3. Software requirements
          4. Basic Linux commands / utilities
            1. Sudo
            2. Nano editor
            3. Source
            4. Jps
            5. Netstat
            6. Man
          5. Preparing a node for a Hadoop-YARN cluster
            1. Install Java
            2. Create a Hadoop dedicated user and group
            3. Disable firewall or open Hadoop ports
            4. Configure the domain name resolution
            5. Install SSH and configure passwordless SSH from the master to all slaves
        2. The Hadoop-YARN single node installation
          1. Prerequisites
          2. Installation steps
            1. Step 1 – Download and extract the Hadoop bundle
            2. Step 2 – Configure the environment variables
            3. Step 3 – Configure the Hadoop configuration files
              1. The core-site.xml file
              2. The hdfs-site.xml file
              3. The mapred-site.xml file
              4. The yarn-site.xml file
              5. The hadoop-env.sh and yarn-env.sh files
              6. The slaves file
            4. Step 4 – Format NameNode
            5. Step 5 – Start Hadoop daemons
        3. An overview of web user interfaces
          1. Run a sample application
        4. The Hadoop-YARN multi-node installation
          1. Prerequisites
          2. Installation steps
            1. Step 1 – Configure the master node as a single-node Hadoop-YARN installation
            2. Step 2 – Copy the Hadoop folder to all the slave nodes
            3. Step 3 – Configure environment variables on slave nodes
            4. Step 4 – Format NameNode
            5. Step 5 – Start Hadoop daemons
        5. An overview of the Hortonworks and Cloudera installations
        6. Summary
      11. 3. Administering a Hadoop-YARN Cluster
        1. Using the Hadoop-YARN commands
          1. The user commands
            1. Jar
            2. Application
              1. Command options
              2. Sample output
            3. Node
              1. Command options
              2. Sample output
            4. Logs
              1. Command options
            5. Classpath
            6. Version
          2. Administration commands
            1. ResourceManager / NodeManager / ProxyServer
            2. RMAdmin
              1. Command options
            3. DaemonLog
              1. Command options
        2. Configuring the Hadoop-YARN services
          1. The ResourceManager service
          2. The NodeManager service
          3. The Timeline server
          4. The web application proxy server
          5. Ports summary
        3. Managing the Hadoop-YARN services
          1. Managing service logs
          2. Managing pid files
        4. Monitoring the YARN services
          1. JMX monitoring
            1. The ResourceManager JMX beans
            2. The NodeManager JMX beans
          2. Ganglia monitoring
            1. Ganglia daemons
            2. Integrating Ganglia with Hadoop
        5. Understanding ResourceManager's High Availability
          1. Architecture
          2. Failover mechanisms
          3. Configuring ResourceManager's High Availability
            1. Define nodes
            2. The RM state store mechanism
            3. The failover proxy provider
            4. Automatic failover
          4. High Availability admin commands
        6. Monitoring NodeManager's health
          1. The health checker script
        7. Summary
      12. 4. Executing Applications Using YARN
        1. Understanding application execution flow
          1. Phase 1 – Application initialization and submission
          2. Phase 2 – Allocate memory and start ApplicationMaster
          3. Phase 3 – ApplicationMaster registration and resource allocation
          4. Phase 4 – Launch and monitor containers
          5. Phase 5 – Application progress report
          6. Phase 6 – Application completion
        2. Submitting a sample MapReduce application
          1. Submitting an application to the cluster
          2. Updates in the ResourceManager web UI
          3. Understanding the application process
          4. Tracking application details
          5. The ApplicationMaster process
          6. Cluster nodes information
          7. Node's container list
          8. YARN child processes
          9. Application details after completion
        3. Handling failures in YARN
          1. The container failure
          2. The NodeManager failure
          3. The ResourceManager failure
        4. YARN application logging
          1. Services logs
          2. Application logs
        5. Summary
      13. 5. Understanding YARN Life Cycle Management
        1. An introduction to state management analogy
        2. The ResourceManager's view
          1. View 1 – Node
          2. View 2 – Application
          3. View 3 – An application attempt
          4. View 4 – Container
        3. The NodeManager's view
          1. View 1 – Application
          2. View 2 – Container
          3. View 3 – A localized resource
        4. Analyzing transitions through logs
          1. NodeManager registration with ResourceManager
          2. Application submission
          3. Container resource allocation
          4. Resource localization
        5. Summary
      14. 6. Migrating from MRv1 to MRv2
        1. Introducing MRv1 and MRv2
        2. High-level changes from MRv1 to MRv2
          1. The evolution of the MRApplicationMaster service
          2. Resource capability
          3. Pluggable shuffle
          4. Hierarchical queues and fair scheduler
          5. Task execution as containers
        3. The migration steps from MRv1 to MRv2
          1. Configuration changes
          2. The binary / source compatibility
        4. Running and monitoring MRv1 apps on YARN
        5. Summary
      15. 7. Writing Your Own YARN Applications
        1. An introduction to the YARN API
          1. YARNConfiguration
            1. Load resources
            2. Final properties
            3. Variable expansion
          2. ApplicationSubmissionContext
          3. ContainerLaunchContext
          4. Communication protocols
            1. ApplicationClientProtocol
            2. ApplicationMasterProtocol
            3. ContainerManagementProtocol
            4. ApplicationHistoryProtocol
          5. YARN client API
        2. Writing your own application
          1. Step 1 – Create a new project and add Hadoop-YARN JAR files
          2. Step 2 – Define the ApplicationMaster and client classes
            1. Define an ApplicationMaster
            2. Define a YARN client
          3. Step 3 – Export the project and copy resources
          4. Step 4 – Run the application using bin or the YARN command
        3. Summary
      16. 8. Dive Deep into YARN Components
        1. Understanding ResourceManager
          1. The client and admin interfaces
          2. The core interfaces
          3. The NodeManager interfaces
          4. The security and token managers
        2. Understanding NodeManager
          1. Status updates
          2. State and health management
          3. Container management
          4. The security and token managers
        3. The YARN Timeline server
        4. The web application proxy server
        5. YARN Scheduler Load Simulator (SLS)
        6. Handling resource localization in YARN
          1. Resource localization terminologies
          2. The resource localization directory structure
        7. Summary
      17. 9. Exploring YARN REST Services
        1. Introduction to YARN REST services
          1. HTTP request and response
            1. Successful response
            2. Response with an error
        2. ResourceManager REST APIs
          1. The cluster summary
          2. Scheduler details
          3. Nodes
          4. Applications
        3. NodeManager REST APIs
          1. The node summary
          2. Applications
          3. Containers
        4. MapReduce ApplicationMaster REST APIs
          1. ApplicationMaster summary
          2. Jobs
          3. Tasks
        5. MapReduce HistoryServer REST APIs
        6. How to access REST services
          1. RESTClient plugins
          2. Curl command
          3. Java API
        7. Summary
      18. 10. Scheduling YARN Applications
        1. An introduction to scheduling in YARN
        2. An overview of queues
        3. Types of queues
          1. CapacityScheduler Queue (CSQueue)
          2. FairScheduler Queue (FSQueue)
        4. An introduction to schedulers
          1. Fair scheduler
            1. Hierarchical queues
            2. Schedulable
            3. Scheduling policy
            4. Configuring a fair scheduler
          2. CapacityScheduler
            1. Configuring CapacityScheduler
        5. Summary
      19. 11. Enabling Security in YARN
        1. Adding security to a YARN cluster
          1. Using a dedicated user group for Hadoop-YARN daemons
          2. Validating permissions to YARN directories
          3. Enabling the HTTPS protocol
          4. Enabling authorization using Access Control Lists
          5. Enabling authentication using Kerberos
        2. Working with ACLs
          1. Defining an ACL value
          2. Type of ACLs
            1. The administration ACL
            2. The service-level ACL
            3. The queue ACL
            4. The application ACL
        3. Other security frameworks
          1. Apache Ranger
          2. Apache Knox
        4. Summary
      20. 12. Real-time Data Analytics Using YARN
        1. The integration of Spark with YARN
          1. Running Spark on YARN
        2. The integration of Storm with YARN
          1. Running Storm on YARN
            1. Create a Zookeeper quorum
            2. Download, extract, and prepare the Storm bundle
            3. Copy Storm ZIP to HDFS
            4. Configuring the storm.yaml file
            5. Launching the Storm-YARN cluster
          2. Managing Storm on YARN
        3. The integration of HAMA and Giraph with YARN
        4. Summary
      21. Index