Apache Hadoop is an open source software framework that allows large sets of data to be processed using commodity hardware. Hadoop is designed to run on top of a large cluster of nodes that are connected to form a large distributed system. Hadoop implements a computational paradigm known as MapReduce, which was inspired by an architecture developed by Google to implement its search technology. The MapReduce model runs over a distributed filesystem and the combination allows Hadoop to process a huge amount of data, while at the same time being fault tolerant.
Hadoop is a complex piece of software, which can be a stumbling block for newcomers. In this article we will briefly cover the basics of Hadoop and explain how the various parts and components of Hadoop fit together to provide the functionality that Hadoop offers. We will also take a look at the Map/Reduce model, which is a central piece of Hadoop, and explore how it is being used within Hadoop to break complex data processing tasks into simpler ones.
The Hadoop architecture is based on a master/slave model. The master node runs the JobTracker, TaskTracker, NameNode and DataNode, whereas a slave node can run the TaskTracker and DataNode. The JobTracker is responsible for handing out MapReduce tasks to specific nodes and keeping track of them. The selection of a node is dictated by availability as well as the “proximity” of the node to the set of data on which the task is to be performed. The TaskTracker is responsible for accepting jobs from the JobTracker and running them on the node. Each TaskTracker has a number of slots, which limits the number of tasks that the node can run. The TaskTracker spawns a new JVM for each task received so that a crash while performing the task does not cause the TaskTracker to fail. It then monitors the progress of this spawned process, capturing its output and exit codes.
All the processing is performed on top of a distributed filesystem. By default, Hadoop comes with the Hadoop Distributed Filesystem (HDFS), which is a distributed, scalable filesystem designed to scale to petabytes of data while running on top of the underlying filesystem of the operating system. HDFS is location-aware, meaning it keeps track of where the data resides in a network by associating with the dataset the name of its rack (or network switch). This allows Hadoop to efficiently schedule tasks to those nodes that contain data (or which are nearest to the data) in order to optimize bandwidth utilization.
NameNode and DataNode are part of HDFS. The NameNode is responsible for maintaining the directory structure of the filesystem and to track where the file data is kept in the Hadoop cluster. The NameNode is a single point of failure in HDFS; if it fails, the whole filesystem will come down (support for a secondary NameNode is present). It does not, however, store any data itself. The DataNode is where data is actually stored. An optimal cluster has multiple DataNodes that store data across multiple locations for increased reliability. The recommended node configuration is to have one TaskTracker and DataNode per server so that MapReduce operations can be run on the server for which data is available locally.
MapReduce is a computational paradigm designed to process very large sets of data in a distributed fashion. The MapReduce model was developed by Google to implement their search technology, specifically the indexing of web pages. The model is based on the concept of breaking the data processing task into two smaller tasks of mapping and reduction. During the map process, a key-value pair in one domain is mapped to a key-value pair in another pair, where the ‘value’ can be a single or a list of multiple values. The keys from the mapping process are then aggregated and the values for the same key combined together. This aggregated data is then fed to the reducer (one call per key) and the reducer then processes this data to produce a final value. The list of all final values for all the keys is the result set.
The key issue in breaking a problem into the MapReduce model is that the map and reduce operations can be performed in parallel on different keys, without the results of one operation affecting the other. This independence of results allows the map/reduce tasks to be distributed in parallel to multiple nodes, which can then perform the respective operations independent of each other. The final results are then aggregated together to produce the final result list.
A classic example used to explain the MapReduce model is the “word counting” example. The problem to be solved is to count the occurrence of each word in a set of documents. The following algorithm illustrates the map and reduce functions, respectively:
function map(name, document):
// name is document name (key)
// document is the contents of the document (value)
foreach word in document:
function reduce(word, count):
// word (key)
// count is the “partial” count of word (value)
sum = 0
foreach c in count:
sum = sum + c
The map function takes a document and maps it to a set of words and partial counts of those words. The underlying framework then combines all the sets of values from the mapping function for each word and passes them to the reduce function. The reduce function adds all the partial counts together to output the total count of each word across all documents.
Apache Hadoop is being successfully used in domains where the MapReduce model can be used to break large processing tasks into simpler, smaller ones. These include pattern-based searching, sorting, reverse indexing, machine learning, statistical machine translation, image processing and analytics. A recent trend involves using Hadoop for Big Data, or very large sets of data, to extract unique trends, insights, and patterns that cannot be determined by looking at smaller data sets only. The Big Data industry is currently on the rise with an increasing adoption by large enterprises, and Apache Hadoop is at the center of this revolution.
In this article we had a bird’s eye view of what Apache Hadoop is, including its architecture, components, and MapReduce framework. We explored how the various parts of Hadoop together make the distributed processing of data possible and also covered the MapReduce computational paradigm from a conceptual point of view. For more in depth coverage of Hadoop, you can visit its official wiki at http://wiki.apache.org/hadoop/.
Safari Books Online has the content you need
Below are some Hadoop books to help you develop applications, or you can check out all of the Hadoop books and training videos available from Safari Books Online. You can browse the content in preview mode or you can gain access to more information with a free trial or subscription to Safari Books Online.
|Ready to unlock the power of your data? With Hadoop: The Definitive Guide, 3rd Edition, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You’ll also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.|
|Hadoop in Action teaches readers how to use Hadoop and write MapReduce programs. This book will lead the reader from obtaining a copy of Hadoop to setting it up in a cluster and writing data analytic programs. This book also takes you beyond the mechanics of running Hadoop, teaching you to write meaningful programs in a MapReduce framework.|
|In Pro Hadoop, you will learn the ins and outs of MapReduce: how to structure a cluster, design and implement the Hadoop file system, and how to structure your first cloud—computing tasks using Hadoop. You will also learn how to let Hadoop take care of distributing and parallelizing your software—you just focus on the code, Hadoop takes care of the rest.|
|If you’ve been tasked with the job of maintaining large and complex Hadoop clusters, or are about to be, Hadoop Operations is a must. You’ll learn the particulars of Hadoop operations, from planning, installing, and configuring the system to providing ongoing maintenance.|
About the authors
|Salman Ul Haq is a techpreneur, co-founder and CEO of TunaCode, Inc., a startup that delivers GPU-accelerated computing solutions to time-critical application domains. He holds a degree is Computer Systems Engineering. His current focus is on delivering the right solution for cloud security. He can be reached at firstname.lastname@example.org.|
|Shaneeb Kamran is a Computer Engineer from one of the leading universities of Pakistan. His programming journey started at the age of 12 and ever since he has dabbled himself in every new and shiny software technology he could get his hands on. He is currently involved in a startup that is working on cloud computing products.|