In programming, as in many fields, the hard part isn’t solving problems, but deciding what problems to solve.
— Paul Graham Great Hackers
On August 6, 2012, the Mars rover Curiosity landed on the red planet millions of miles from Earth. A great deal of engineering and technical expertise went into this mission. Just as exciting was the information technology behind this mission and the use of AWS services by the NASA’s Jet Propulsion Laboratory (JPL). Shortly before the landing, NASA was able to provision stacks of AWS infrastructure to support 25 Gbps of throughput to provide NASA’s many fans and scientists up-to-the-minute information about the rover and its landing.Today, NASA continues to use AWS to analyze data and give scientists quick access to scientific data from the mission.
Why is this an important event in a book about Amazon Elastic MapReduce? Access to these types of resources used to be available only to governments or very large multi-national corporations. Now this power to analyze volumes of data and support high volumes of traffic in an instant is available to anyone with a laptop and a credit card. What used to take months—with the buildout of large data centers, computing hardware, and networking—can now be done in an instant and for short-term projects in AWS.
Today, businesses need to understand their customers and identify trends to stay ahead of their competition. In finance and corporate security, businesses are being inundated with terabytes and petabytes of information. IT departments with tight budgets are being asked to make sense of the ever-growing amount of data and help businesses stay ahead of the game. Hadoop and the MapReduce framework have been powerful tools to help in this fight. However, this has not eliminated the cost and time needed to build out and maintain vast IT infrastructure to do this work in the traditional data center.
EMR is an in-the-cloud solution hosted in Amazon’s data center that supplies both the computing horsepower and the on-demand infrastructure needed to solve these complex issues of finding trends and understanding vast volumes of data.
Throughout this book, we will explore Amazon EMR and how you can use it to solve data analysis problems in your organization. In many of the examples, we will focus on a common problem many organizations face: analyzing computer log information across multiple disparate systems. Many businesses are required by compliance regulations that exist, such as the Health Insurance Portability and Accountability Act (HIPAA) and the Payment Card Industry Data Security Standard (PCI DSS), to analyze and review log information on a regular, if not daily, basis. Log information from a large enterprise can easily grow into terabytes or petabytes of data. We will build a number of building blocks of an application that takes in computer log information and analyzes it for trends utilizing EMR. We will show you how to utilize Amazon EMR services to perform this analysis and discuss the economics and costs of doing so.
AWS has grown greatly over the years from its origins as a provider of remotely hosted infrastructure with virtualized computer instances called Amazon Elastic Compute Cloud (EC2). Today, AWS provides many, if not all, of the building blocks used in many applications today. Throughout this book, we will focus on a number of the key services Amazon provides.
Amazon S3 is the persistent storage for AWS. It provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the Web. There are some restrictions, though; data in S3 must be stored in named buckets, and any single object can be no more than 5 terabytes in size. The data stored in S3 is highly durable and is stored in multiple facilities and multiple devices within a facility. Throughout this book, we will use S3 storage to store many of the Amazon EMR scripts, source data, and the results of our analysis.
As with almost all AWS services, there are standard REST- and SOAP-based web service APIs to interact with files stored on S3. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of websites. The service aims to maximize benefits of scale and to pass those benefits on to developers. To read Amazon’s overview of S3, visit the Amazon S3 web page. Amazon S3’s permanent storage will be used to store data sets and computed result sets generated by Amazon EMR Job Flows. Applications built with Amazon EMR will need to use some S3 services for data storage.
Amazon EC2 makes it possible to run multiple instances of virtual machines on demand inside any one of the AWS regions. The beauty of this service is that you can start as many or as few instances as you need without having to buy or rent physical hardware like in traditional hosting services. In the case of Amazon EMR, this means we can scale the size of our Hadoop cluster to any size we need without thinking about new hardware purchases and capacity planning. Individual EC2 instances come in a variety of sizes and specifications to meet the needs of different types of applications. There are instances tailored for high CPU load, high memory, high I/O, and more. Throughout this book, we will use native EC2 instances for a lot of the scheduling of Amazon EMR Job Flows and to run many of the mundane administrative and data manipulation tasks associated with our application building blocks. We will, of course, be using the Amazon EMR EC2 instances to do the heavy data crunching and analysis.
To read Amazon’s overview of EC2, visit the Amazon EC2 web page. Amazon EC2 instances are used as part of an Amazon EMR cluster throughout the book. We also utilize EC2 instances for administrative functions and to simulate live traffic and data sets. In building your own application, you can run the administrative and live data on your own internal hosts, and these separate EC2 instances are not a required service in building an application with Amazon EMR.
Amazon EMR is an AWS service that allows users to launch and use resizable Hadoop clusters inside of Amazon’s infrastructure. Amazon EMR, like Hadoop, can be used to analyze large data sets. It greatly simplifies the setup and management of the cluster of Hadoop and MapReduce components. EMR instances use Amazon’s prebuilt and customized EC2 instances, which can take full advantage of Amazon’s infrastructure and other AWS services. These EC2 instances are invoked when we start a new Job Flow to form an EMR cluster. A Job Flow is Amazon’s term for the complete data processing that occurs through a number of compute steps in Amazon EMR. A Job Flow is specified by the MapReduce application and its input and output parameters.
Figure 1-1 shows an architectural view of the EMR cluster.
Amazon EMR performs the computational analysis using the MapReduce framework. The MapReduce framework splits the input data into smaller fragments, or shards, that are distributed to the nodes that compose the cluster. From Figure 1-1, we note that a Job Flow is executed on a series of EC2 instances running the Hadoop components that are broken up into master, core, and task clusters. These individual data fragments are then processed by the MapReduce application running on each of the core and task nodes in the cluster. Based on Amazon EMR terminology, we commonly call the MapReduce application a Job Flow throughout this book.
The master, core, and task cluster groups perform the following key functions in the Amazon EMR cluster:
reduceportions of our Job Flow, and store intermediate data to the Hadoop Distributed File System (HDFS) storage in our Amazon EMR cluster. The master node manages the tasks and data delegated to the core and task nodes. Due to the HDFS storage aspects of core nodes, a loss of a core node will result in data loss and possible failure of the complete Job Flow.
reducejobs, but does not have HDFS storage of the data and intermediate results. The lack of HDFS storage on these instances means the data needs to be transferred to these nodes by the master for the task group to do the work in the Job Flow.
The master and core group instances are critical components in the Amazon EMR cluster. A loss of a node in the master or core group instance can cause an application to fail and need to be restarted. Task groups are optional because they do not control a critical function of the Amazon EMR cluster. In terms of jobs and responsibilities, the master group must maintain the status of tasks. A loss of a node in the master group may make it so the status of a running task cannot be determined or retrieved and lead to Job Flow failure.
The core group runs tasks and maintains the data retained in the Amazon EMR cluster. A loss of a core group node may cause data loss and Job Flow failure.
A task node is only responsible for running tasks delegated to it from the master group and utilizes data maintained by the core group. A failure of a task node will lose any interim calculations. The master node will retry the task node when it detects failure in the running job. Because task group nodes do not control the state of jobs or maintain data in the Amazon EMR cluster, task nodes are optional, but they are one of the key areas where capacity of the Amazon EMR cluster can be expanded or shrunk without affecting the stability of the cluster.
As we’ve already seen, Amazon EMR uses Hadoop and its MapReduce framework at its core. Accordingly, many of the other core Apache Software Foundation projects that work with Hadoop also work with Amazon EMR. There are also many other AWS services that may be useful when you’re running and monitoring Amazon EMR applications. Some of these will be covered briefly in this book:
So how does using Amazon EMR compare to building out Hadoop in the traditional data center? Many of the AWS cloud considerations we discuss in Appendix B are also relevant to Amazon EMR. Compared to allocating resources and buying hardware in a traditional data center, Amazon EMR can be a great place to start a project because the infrastructure is already available at Amazon. Let’s look at a number of key areas that you should consider before embarking on a new Amazon EMR project.
Amazon EMR uses S3 storage for the input and output of data sets to be processed and analyzed. In order to process data, you need to transport it from the many sources where it currently lives up to Amazon’s cloud into S3 buckets. This is not a major issue for projects transitioning from other AWS services, but may be a barrier to projects that need to transport terabytes or petabytes of data from another cloud provider or hosted in a private data center to Amazon’s S3 storage.
In the traditional Hadoop install, data transport between the current source locations and the Hadoop cluster may be colocated in the same data center on high-speed internal networks. This lowers the data transport barriers and the amount of time to get data into Hadoop for analysis. Figure 1-2 shows the data locations and network topology differences between an Amazon EMR and traditional Hadoop installation.
If this will be a large factor in your project, you should review Amazon’s S3 Import and Export service option. The Import and Export service for S3 allows you to prepare portable storage devices that you can ship to Amazon to import your data into S3. This can greatly decrease the time and costs associated with getting large data sets into S3 for analysis. This approach can also be used in transitioning a project to AWS and EMR to seed the existing data into S3 and add data updates as they occur.
Many people point to Hadoop’s use of low-cost hardware to achieve enormous compute capacity as one of the great benefits of using Hadoop compared to purchasing large, specialized hardware configurations. We couldn’t agree more when comparing what Hadoop achieves in terms of cost and compute capacity in this model. However, there are still large upfront costs in building out a modest Hadoop cluster. There are also the ongoing operational costs of electricity, cooling, IT personnel, hardware retirement, capacity planning and buildout, and vendor maintenance contracts on the operating system and hardware.
With Amazon EMR, you only pay for the services you use. You can quickly scale capacity up and down, and if you need more memory or CPU for your application, this is a simple change in your EC2 instance types when you’re creating a new Job Flow. We’ll explore the costs of Amazon EMR in Chapter 6 and help you understand how to estimate costs to determine the best solution for your organization.
With the low-cost hardware of Hadoop clusters, many organizations start proof-of-concept data analysis projects with a small Hadoop cluster. The success of these projects leads many organizations to start building out their clusters and meet production-level data needs. These projects eventually reach a tipping point of complexity where much of the cost savings gained from the low-cost hardware is lost to the administrative, labor, and data center cost burdens. The time and labor commitments of keeping thousands of Hadoop nodes updated with OS security patches and replacing failing systems can require a great deal of time and IT resources. Estimating them and being able to compare these costs to EMR will be covered in detail in Chapter 6.
With Amazon EMR, the EMR cluster nodes exist and are maintained by Amazon. Amazon regularly updates its EC2 Amazon Machine Images (AMI) with newer releases of Hadoop, security patches, and more. By default, a Job Flow will start an EMR cluster with the latest and greatest EC2 AMIs. This removes much of the administrative burden in running and maintaining large Hadoop clusters for data analysis.
In order to show the power of using AWS for building applications, we will build a number of building blocks for a MapReduce log analysis application. In many of our examples throughout this book, we will use these building blocks to perform analysis of common computer logfiles and demonstrate how these same building blocks can be used to attack other common data analysis problems. We will discuss how AWS and Amazon EMR can be utilized to solve different aspects of these analysis problems. Figure 1-3 shows the high-level functional diagram of the AWS components we will use in the upcoming chapters. Figure 1-3 also highlights the workflow and inter-relationships between these components and how they share data and communicate in the AWS infrastructure.
Using our building blocks, we will explore how these can be used to ingest large volumes of log data, perform real-time and batch analysis, and ultimately produce results that can be shared with end users. We will derive meaning and understanding from data and produce actionable results. There are three component areas for the application: collection stage, analysis stage, and the nuts and bolts of how we coordinate and schedule work through the many services we use. It might seem like a complex set of systems, interconnections, storage, and so on, but it’s really quite simple, and Amazon EMR and AWS provide us a number of great tools, services, and utilities to solve complex data analysis problems.
In the next set of chapters, we will dive into each component area of the application and highlight key portions of solving data analysis problems:
reducemethods that will be run as Job Flows in Amazon EMR. In Chapter 4, we will show you that you don’t have to be a NASA rocket scientist or a Java programmer to use Amazon EMR. We will revisit the same analysis issues covered in earlier chapters, and using more high-level scripting tools like Pig and Hive, solve the same problems. Hadoop and Amazon EMR allow us to bring to bear a significant number of tools to mine critical information out of our data.
By now, you hopefully have an understanding of how AWS and Amazon EMR could provide value to your organization. In the next chapter, you will start getting your hands dirty. You’ll generate some simple log data to analyze and create your first Amazon EMR Job Flow, and then do some simple data frequency analysis on those sample log messages.