Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

With Apache Hadoop MapReduce, users who previously had to use a relational database to store data and process it using SQL queries can now play with mammoth sizes of unstructured data. MapReduce has simplified data analytics, and processing has become much simpler and, more importantly, scalable with MapReduce. Hadoop is by far the most widely used implementation of the MapReduce paradigm.

You can get started using the MapReduce framework on your own machine with a local mode installation of Hadoop. Simply unzip, configure, and you’re ready to write your first MapReduce example. Local mode is strictly for beginners who are getting started with Hadoop. You can find more on getting started with Hadoop on your machine on the official page.

Since MapReduce is ideal to process huge amounts of unstructured data, it naturally implies that you will require a sizeable compute and storage infrastructure for any serious Hadoop deployment. This is where Amazon EMR comes in. EMR which is short for Elastic MapReduce is a ready-made web service that offers MapReduce running on Amazon EC2. If you’re good at Amazon Web Services (AWS), it will only take you a couple of minutes to get an EMR service up and running. This article will walk you through this process. The best thing about EMR is that you do not have to take care of EC2 instance provisioning, since EMR takes care of firing up and shutting down new instances on demand. With Amazon EMR, you don’t have to worry about configuring a Hadoop cluster, so you can focus on crunching your big data.

This blog post uses the age old WordCount example – the very first example you will code when you get started on MapReduce – like the “Hello World” of MapReduce. We’ll use Amazon S3 to store an input dataset (text files) and output from the reduce stage. Since this post is about running a MapReduce job on Amazon EMR, we’ll assume that you have the WordCount source built and have the JAR file ready. If not, you can find it on the official Hadoop page. Let’s get started with the steps involved in deploying and executing a WordCount example on Amazon EMR.

Step 1 – Setting up Amazon AWS

If you do not have an Amazon AWS account, signing up is a two minutes process. You can sign up at http://aws.amazon.com. Once you have the account, open the Amazon S3 managing console. As mentioned before, we will be using S3 to store our input dataset as well as our output. For those who are new to S3, it’s a cloud storage service which lets you create infinite – well, not literally infinite – “buckets”. Buckets are more like directories where you store your “objects” (data files).

Step 2 – Creating S3 Buckets

We will create individual buckets for input (plain text files to count their words), an output file (word count), a bucket to store the WordCount JAR file, and a bucket to store the logs from the computational flow. S3 bucket names are global, which means your bucket name has to be unique.

Step 3 – Setting up Amazon EMR

Now that we have the buckets to hold the data, let’s configure and launch our MapReduce Job flow. Log into your AWS account and go to the EMR page (https://console.aws.amazon.com/elasticmapreduce). Click the “Create New Job Flow” button. Insert the name of your job (descriptive) and the version of Hadoop you want to use. Currently Amazon offers its own MapR distributions. Select Amazon Distribution. Then select “Run your own application” option, and then select “Custom JAR,” since we would like to execute our own WordCount example:

On the next tab, you will have to specify the relative path of the JAR file and its arguments. The format for the location is: <bucket_name>/<jar_name>. Our WordCount JAR file will take the JAR’s main file, followed by the bucket name where you uploaded the input data and the output path. Note, that you only have to provide the paths, and not the precise file names. Also, make sure that no output file exists in the output path. The format for specifying input and output paths is: s3n://<bucket_name>/path. Press Continue.

Now, since we’re not going to count words in terabytes of text files, all we need is a small instance of the Hadoop master node and two small instances of the Hadoop slave instances. For some serious work, however, keep in mind that the master node does not have failover recovery. It’s recommended to keep saving the intermediate state by logging data that has been mapped, but hasn’t yet been reduced. Also note that by default, Amazon EMR has a limit of 20 instances. You have to get permission in case you need more. You can select between core and task instances. Core instances will persist data, while task instances won’t. In our case, we’ll spin two core instances.

We’re almost done. On the “Advanced Options” tab, leave the default settings and specify the bucket and folder name where the MapReduce log will be saved.

On the “Bootstrap Actions” tab, keep the default settings and click Continue. Review your job and then launch it.

To monitor the status of your MapReduce job, go to your EMR account and refresh the page so you will be able to see the status. You should now be able to run your own Hadoop MapReduce Jobs on Amazon EMR!

You can find a wealth of information on Hadoop and MapReduce in the eBooks referenced below.

Safari Books Online has the content you need

Hadoop MapReduce Cookbook deals with many exciting topics such as setting up Hadoop security, using MapReduce to solve analytics, classifications, on-line marketing, recommendations, and searching use cases. You will learn how to harness components from the Hadoop ecosystem including HBase, Hadoop, Pig, and Mahout, then learn how to set up cloud environments to perform Hadoop MapReduce computations.
Hadoop Beginner’s Guide removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.
In Hadoop Real-World Solutions Cookbook covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.
Hadoop in Practices collects 85 Hadoop examples and presents them in a problem/solution format. Each technique addresses a specific task you’ll face, like querying big data using Pig or writing a log file loader. You’ll explore each problem step by step, learning both how to build and deploy that specific solution along with the thinking that went into its design. As you work through the tasks, you’ll find yourself growing more comfortable with Hadoop and at home in the world of big data.

About the author

Salman Ul Haq is a techpreneur, co-founder and CEO of TunaCode, Inc., a startup that delivers GPU-accelerated computing solutions to time-critical application domains. He holds a degree is Computer Systems Engineering. His current focus is on delivering the right solution for cloud security. He can be reached at salman@tunacode.com.

Tags: Amazon AWS, Amazon EMR, Apache Hadoop, MapReduce,

Comments are closed.