Chapter 2. Data Collection and Data Analysis with AWS

Now that we’ve covered the basics of AWS and Amazon EMR, you can get to work on using Amazon’s tools in the cloud. To get started, you’ll create some sample data to parse your first Amazon EMR job. A number of AWS tools and techniques will be required as part of this exercise to move the data to a location that Amazon EMR can access and work on. This should give you a solid background on what is available, and how to begin thinking about your data and overcoming challenges of moving your data into AWS.

Amazon EMR is built with many’ of the core components and frameworks of Apache Hadoop. Apache Hadoop allows organizations to build data-intensive distributed applications across a cluster of low-cost hardware. Amazon EMR simply takes this technology and moves it to the Amazon cloud to run at web scale on Amazon’s AWS hardware.

The key to all of this is the MapReduce framework. MapReduce is a powerful framework used to break down large data sets into smaller sets that can be processed in Amazon EMR across multiple EC2 instances that compose a cluster. To demonstrate the power of this concept, in this chapter you’ll create an Amazon EMR Cluster, also known as a Job Flow in Java. The Job Flow will determine message frequency for the test sample data set. Of course, as with learning anything new, you are bound to make mistakes and errors in the development of an Amazon EMR Job Flow. Toward the end of the chapter, we will intentionally ...

Get Programming Elastic MapReduce now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.