O'Reilly logo

Programming Elastic MapReduce by Christopher Phillips, Kevin Schmidt

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 2. Data Collection and Data Analysis with AWS

Now that we’ve covered the basics of AWS and Amazon EMR, you can get to work on using Amazon’s tools in the cloud. To get started, you’ll create some sample data to parse your first Amazon EMR job. A number of AWS tools and techniques will be required as part of this exercise to move the data to a location that Amazon EMR can access and work on. This should give you a solid background on what is available, and how to begin thinking about your data and overcoming challenges of moving your data into AWS.

Amazon EMR is built with many’ of the core components and frameworks of Apache Hadoop. Apache Hadoop allows organizations to build data-intensive distributed applications across a cluster of low-cost hardware. Amazon EMR simply takes this technology and moves it to the Amazon cloud to run at web scale on Amazon’s AWS hardware.

The key to all of this is the MapReduce framework. MapReduce is a powerful framework used to break down large data sets into smaller sets that can be processed in Amazon EMR across multiple EC2 instances that compose a cluster. To demonstrate the power of this concept, in this chapter you’ll create an Amazon EMR Cluster, also known as a Job Flow in Java. The Job Flow will determine message frequency for the test sample data set. Of course, as with learning anything new, you are bound to make mistakes and errors in the development of an Amazon EMR Job Flow. Toward the end of the chapter, we will intentionally ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required