Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

Apache Hadoop is a Java framework for performing distributed computation on a large cluster of commodity hardware. It is inspired from Google’s MapReduce and BigTable implementation for processing very large amounts of datasets. Hadoop is at the center of the “Big Data” movement that is revolutionizing the way organizations manage and process their data.

The goal of this article is to outline the steps necessary for a simple installation of Hadoop. The cluster in our installation will only consist of the user’s machine, but it will be sufficient to demonstrate the steps required. The operating system used for demonstration will be Ubuntu 10.04 (or later) and it is assumed that the reader is familiar with basic command line editing. But essentially, the same steps can be used to carry out an installation on other operating systems as well.

Installing the prerequisite software

Since Hadoop is Java-based, JRE is the main software requirement. As of the time of writing this article, Hadoop requires Java 1.6 or later. To install Java on Ubuntu, we will use the package manager provided by the OS:

Here we have chosen to install the complete Java Development Kit (JDK) provided by the OpenJDK project. You are free to use any other distribution, for example the distribution provided by Oracle (known as the Sun JDK). However, the installation of these other alternatives is beyond the scope of this article, but there are many good guides available online.

To check that Java is installed properly, try running:

If everything went smoothly this will output the version of Java that has been installed.

The next step is to create a dedicated user and group for our Hadoop installation. This allows all of the installation to be insulated from the rest of the environment, as well as enable tighter security measures to be enforced (in case you have a production environment). We will create a user hduser and a group hadoop, and add the user to the group. This can be done using the following commands:

The nodes within the Hadoop cluster communicate via SSH, so we need to first configure the SSH server and then set up an appropriate configuration to allow the nodes in the cluster to communicate securely with each other (in our case, localhost to localhost communication). To do this, we first need to install the OpenSSH server and clients on our machine. Once again, we use Ubuntu’s package manager to install the required packages:

Once the installation is complete, we need to generate a pair of authentication keys and add them to the list of authorized keys of our local server. We need to perform these actions as the newly created hduser, so we first “change” to that user:

Next, we generate the authentication keys using the key generation tool of OpenSSH:

We specify the key to use the RSA public key encryption and a blank (“”) password (this is necessary in order to enable the nodes to communicate without any manual intervention for typing in passwords). Once the key is generated, we add it to the list of authorized keys by doing the following:

Finally, we need to test the connection to our localhost over SSH to see if everything is configured properly. Type in:

If everything went smoothly, the SSH connection would be completed successfully and we would be logged in to our account (over SSH).

Downloading Hadoop

Next, we need to download Hadoop from the download mirrors. We then extract the contents of the Hadoop package to any location. In this article, we will use /usr/local/hadoop. We need to change the owner and group of all the files to our group and user:

Configuring Hadoop

Add the following lines at the end of the .bashrc file in your home directory (in case you are using another shell, you need to add corresponding lines to the appropriate configuration file):

In the JAVA_HOME environment variable, the java-6-openjdk-i386 directory points to the directory of our installed JVM. You have to check the /usr/lib/jvm/ directory on your system to determine the exact name of this directory (and then substitute the proper name in the configuration snippet above).

Next, we need to modify the configuration files in the /usr/local/hadoop/conf directory (provided Hadoop is installed in /usr/local/hadoop/, as in our case).

We start off with the hadoop-env.sh. Open this file in your favorite editor and navigate to the commented line with the ‘export JAVA_HOME…’ statement. Uncomment this line (by removing the hash character) and update it to:

Again, substitute the directory java-6-openjdk-i386 with the appropriate directory on your system.

Hadoop uses the Hadoop Distributed Filesystem (HDFS) to store all of the data in its cluster. Read Chapter 3. The Hadoop Distributed Filesystem in Hadoop: The Definitive Guide, 3rd Edition for more on HDFS. In the process, Hadoop also needs to create temporary files, and for that we need to specify a directory. Before we do that, we need to create a directory to act as this temporary location:

With the directory in place, we need to update Hadoop’s default configuration. For this we open up the core-site.xml file and between the <configuration>…</configuration> tags add the following lines:

Similarly we update the mapred-site.xml file by adding:

And the hdfs-site.xml file by adding:

The property elements need to be added between the <configuration>…</configuration> tags for both the files. For more information about the configuration options available, consult the documentation at Hadoop’s API Overview.

Running Hadoop

Before we can run Hadoop for the first time, we need to format the HDFS filesystem. To do this, we run:

To start Hadoop, along with all its associated tasks, we execute:

To check that Hadoop has actually started, we can use the Java Process (jps) tool:

This needs to output something like the following (in case of the successful start up of Hadoop):

You can then load and run any MapReduce tasks that you want on the Hadoop system.

To close down Hadoop, and all the associated tasks, we can run:

And that covers it.

Conclusion

Apache Hadoop, while powerful, is a complex tool that requires extensive configuration and tweaking to run and manage properly. This article covered a basic installation scenario. There are numerous books and resources available that cover more advanced usages, and these need to be consulted if you are planning to use Hadoop in a real-world environment.

Safari Books Online has the content you need

Check out these Hadoop books available from Safari Books Online:

Ready to unlock the power of your data? With Hadoop: The Definitive Guide, 3rd Edition, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You’ll also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
In Pro Hadoop, you will learn the ins and outs of MapReduce: how to structure a cluster, design and implement the Hadoop file system, and how to structure your first cloud—computing tasks using Hadoop. You will also learn how to let Hadoop take care of distributing and parallelizing your software—you just focus on the code, Hadoop takes care of the rest.
If you’ve been tasked with the job of maintaining large and complex Hadoop clusters, or are about to be, Hadoop Operations is a must. You’ll learn the particulars of Hadoop operations, from planning, installing, and configuring the system to providing ongoing maintenance.

About the author

Shaneeb Kamran is a Computer Engineer from one of the leading universities of Pakistan. His programming journey started at the age of 12 and ever since he has dabbled himself in every new and shiny software technology he could get his hands on. He is currently involved in a startup that is working on cloud computing products.

Tags: Apache Hadoop, Hadoop Cluster, Hadoop Distributed Filesystem, HDFS, java, OpenJDK, OpenSSH,

Comments are closed.