Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

codeA guest post by Martin ‘MC’ Brown, the author and contributor to over 26 books covering an array of topics, including the recently published Getting Started with CouchDB. His expertise spans myriad development languages and platforms Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac OS and more. Martin currently works as the Director of Documentation for Continuent.

Within Hadoop, the key processing environment is the MapReduce system that allows for large volumes of data to be simplified and summarized into a resulting analysis. MapReduce requires significant resources to operate, but these resources were difficult to manage, especially across multi-tenant environments where there are multiple requests for these resources from different jobs.

YARN, alongside an updated MapReduce 2 (MR2) environment, provides some significant differences and advantages compared to the original MapReduce implementation. The primary difference is the way in which the resource management and the application execution systems have been separated.

YARN provides the primary resource management architecture, while the core operation of executing the MapReduce tasks are part of the larger application execution. The result is that the execution of the individual components can be much more effectively controlled, and, within the entire cluster, the management of the resources across your Hadoop cluster are more effectively handled, particularly when operating with multiple jobs and applications. This makes multi-tenant environments, where multiple applications are being executed simultaneously, much easier and straightforward.

Yarn’s ResourceRequest Structure

The key element to this is the new ResourceRequest structure, which is used by the ResourceManager to control jobs as they are executed within the cluster. The ResourceManager schedules all jobs and enables YARN to more effectively manage jobs by knowing the structure and capabilities of the cluster. The ResourceRequest is created by each application and provides the ResourceManager with the information needed to make decisions about executing the job. The ResourceRequest is composed of four different components:

  • The resource-name is the name of a host, rack, or wildcard (*), which means the job can run anywhere. The specification here is key, because in the future it will enable the finer selection of requirements, such as requesting specific machines, or specifying whether a job is safe to be executed on nodes running within virtual machines.
  • The priority is the priority for this job within the overall application.
  • The resource-requirement specifies any special RAM or CPU requirements. This can be useful in systems where hardware FPU speeds, RAM availability, or physical/virtual core differences may affect execution.
  • The number-of-containers is the number of this type of resource required for the overall job.

Containers are used to specify the executable component and capabilities, such as CPU or RAM, and each node may define multiple containers. Thus, a ResourceRequest defines how many of these containers of specific types are required, and it is up to YARN to match up the requested containers against the available containers across the cluster.

A typical application will therefore be made up of multiple jobs, each having their own ResourceRequest. YARN takes in this information and translates that into a list of valid hosts on which the job can be executed.

An additional level of control is provided in that each ResourceRequest can either be a requirement or negotiable. For example, if your application has strict RAM requirements, then YARN can take this request and execute the job only on suitable hosts. Conversely, the RAM request could be asked for, but it is not critical. If higher RAM nodes are not available, then the job will be executed on nodes with lower RAM.

YARN Schedulers and Queues

YARN manages these requirements and expectations through a further series of schedulers that can be configured with the yarn-default.xml file. The FIFO (First In, First Out) scheduler just applies jobs in the order they were received. If your Hadoop cluster is used by a number of different parties, then the capacity scheduler may make more sense, as it distributes work according to the ResourceRequirement and the overall capacity. It also provides an access control list implementation to allow different containers, jobs, and nodes within the Hadoop cluster to operate together to execute different queues of jobs.

A final ‘Fair’ scheduler attempts to combine these models to provide multi-tenanted scheduling across different requirements and resource requirements through an entirely queued system. The fair scheduler also has the ability to configure a number of hierarchical queues, which means that jobs can now be executed through a series of queues in parallel at different stages of the overall job execution.

Schedulers within YARN are pluggable, allowing you to alter beyond the three standard components, and they offer different levels of control and multi-tenant capabilities.

For example, within the capacity scheduler, overall capacity is controlled through the capacity-scheduler.xml file within the configuration, and in many cases can be configured dynamically, which the reconfiguration pushed out from the command-line:

Overall control over the queues and capacities is driven by a simple structure that can be expanded to the hierarchical model. For example, you can configure a ‘root’ base queue:

This defines two groups, fast, with 80% capacity, and slow with 20% of the overall capacity.

Queues can be further subdivided, for example:

This results in a fast queue with 80% of capacity, and a sub queue with 60% of that 80%, thus 48% of the entire cluster capacity, and a quick queue with 32%.

This level of fine control over the capacity extends to the cluster, CPU and RAM capacity accordingly.

It should be noted that the capacity scheduler is similar to that provided by the previous release of Hadoop, albeit with significantly more control as provided through the ResourceRequest environment. Equally, most Hadoop 1 MapReduce jobs can be executed through Hadoop 2/YARN without any modification. What YARN offers is the ability to control that much more effectively.

A major advantage of Hadoop 2/YARN is the management of the actual application execution. Instead of the Job Tracker being responsible for the execution management, an ApplicationMaster process is created that manages the execution run. Unlike the Job Tracker, Application Masters are executed on data nodes and can more easily restart. However, they are also able to fail, and as they are distributed, they are subject to proportionately higher failures.

Conclusion

Overall, YARN provides a much more powerful, and richer, environment for executing jobs within Hadoop, which means jobs can be more effectively executed, whether you are running 1 or 100 jobs at the same time. Although Yarn requires more work to setup effectively, the longer lasting effects on the capability of your cluster are worth the additional effort.

Look below for some great Big Data books from Safari Books Online.

Not a subscriber? Sign up for a free trial.

Safari Books Online has the content you need

Hadoop Real-World Solutions Cookbook provides in depth explanations and code examples. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.
Getting Started with CouchDB covers how CouchDB’s simple model for storing, processing, and accessing data makes it ideal for the type of data and rapid response users now demand from your applications—and how easy CouchDB is to set up, deploy, maintain, and scale.
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 is written by YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli and demonstrates how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment.
Professional Hadoop Solutions is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth.
Data Science for Business introduces you to the fundamental principles of data science and walks you through the “data-analytic thinking” necessary for extracting useful knowledge and business value from the data you collect. By learning data science principles, you will understand the many data-mining techniques in use today. More importantly, these principles underpin the processes and strategies necessary to solve business problems through data mining techniques.

Tags: Big Data, Couch DB, Hadoop, MapReduce 2, Queues, ResourceRequest, Schedulers, Yarn,

Comments are closed.