Preface

Many organizations have a treasure trove of data stored away in the many silos of information within them. To unlock this information and use it to compete in the marketplace, organizations have begun looking to Hadoop and “Big Data” as the key to gaining an advantage over their competition. Many organizations, however, lack the knowledgeable resources and data center space to launch large-scale Hadoop solutions for their data analysis projects.

Amazon Elastic MapReduce (EMR) is Amazon’s Hadoop solution, running in Amazon’s data center. Amazon’s solution is allowing organizations to focus on the data analysis problems they want to solve without the need to plan data center buildouts and maintain large clusters of machines. Amazon’s pay-as-you-go model is just another benefit that allows organizations to start these projects with no upfront costs and scale instantly as the project grows. We hope this book inspires you to explore Amazon Web Services (AWS) and Amazon EMR, and to use this book to help you launch your next great project with the power of Amazon’s cloud to solve your biggest data analysis problems.

This book focuses on the core Amazon technologies needed to build an application using AWS and EMR. We chose an application to analyze log data as our case study throughout this book to demonstrate the power of EMR. Log analysis is a good case study for many data analysis problems that organizations faced. Computer logfiles contain large amounts of diverse data from different sources and can be mined to gain valuable intelligence. More importantly, logfiles are ubiquitous across computer systems and provide a ready and available data set with which you can start solving data analysis problems.

Here is an outline of what this book provides:

  • Sample configurations for third-party software
  • Step-by-step configurations for AWS
  • Sample code
  • Best practices
  • Gotchas

The intent is not to provide a book that has all the code, configuration, and so on, to be able to plop this application on AWS and start going. Instead, we will provide guidance to help you see how to put together a system or application in a cloud environment and describe core issues you may face in working within AWS in building your own project.

You will get the most out of this book if you have a some experience developing or managing applications developed for the traditional data center, but now want to learn how you can move your applications and data into a cloud environment. You should be comfortable using development toolsets and reviewing code samples, architecture diagrams, and configuration examples to understand basic concepts covered in this book. We will use the command line and command-line tools in Unix on a number of the examples we present, so it would not hurt to be familiar with navigating the command line and using basic Unix command-line utilities. The examples in this book can be used on Windows systems too, but you may need to load third-party utilities like Cygwin to follow along.

This book will challenge you with new ways of looking at your applications outside of your traditional data center walls, but hopefully it will open your eyes to the possibilities of what you can accomplish when you focus on the problems you are trying to solve rather than the many administrative issues of building out new servers in a private data center.

What Is AWS?

Amazon Web Services is the name of the computing platform started by Amazon in 2006. AWS offers a suite of services to companies and third-party developers to build solutions using the computing and software resources hosted in Amazon’s data centers around the globe. Amazon Elastic MapReduce is one of many available AWS services. Developers and companies only pay for the resources they use with a pay-as-you-go model in AWS. This model is changing the approach many businesses take at looking at new projects and initiatives. New initiatives can get started and scale within AWS as they build a customer base and grow without much of the usual upfront costs of buying new servers and infrastructure. Using AWS, companies can now focus on innovation and on building great solutions. They are able to focus less on building and maintaining data centers and the physical infrastructure and can focus on developing solutions.

What’s in This Book?

This book is organized as follows. Chapter 1 introduces cloud computing and helps you understand Amazon Web Service and Amazon Elastic MapReduce. Chapter 2 gets us started exploring the Amazon tools we will be using to examine log data and execute our first Job Flow inside of Amazon EMR. In Chapter 3, we get down to the business of exploring the types of analyses that can be done with Amazon EMR using a number of MapReduce design patterns, and review the results we can get out of log data. In Chapter 5, we delve into machine learning techniques and how these can be implemented and utilized in our application to build intelligent systems that can take action or recommend a solution to a problem. Finally, in Chapter 6, we review project cost estimation for AWS and EMR applications and how to perform cost analysis of a project.

Sign Up for AWS

To get started, you need to sign up for AWS. If you are already an AWS user, you can skip this section because you already have access to each of the AWS services used throughout this book. If you are a new user, we will get you started in this section.

To sign up for AWS, go to the AWS website, as shown in Figure 1.

Amazon Web Services home page
Figure 1. Amazon Web Services home page

You will need to provide a phone number to verify that you are setting up a valid account and you will also need to provide a credit card number to allow Amazon to bill you for the usage of AWS services. We will cover how to estimate, review, and set up billing alerts within AWS in Chapter 6.

After signing up for an AWS account, go to your My Account page to review the services to which you now have access. Figure 2 shows the available services under our account, but your results will likely look somewhat different.

Tip

Remember, there are charges associated with the use of AWS, and a number of the examples and exercises in this book will incur charges to your account. With a new AWS account, there is a free tier. To minimize the costs while learning about Amazon Elastic MapReduce, review the free-tier limitations, turn off instances after running through your exercises, and learn how to estimate costs in Chapter 6.

AWS services available after signup
Figure 2. AWS services available after signup

Code Samples in This Book

There are numerous code samples and examples throughout this book. Many of the examples are built using the Java programming language or Hadoop Java libraries. To get the most out of this book and follow along, you need to have a system set up to do Java development and Hadoop Java JAR files to build an application that Amazon EMR can consume and execute. To get ready to develop and build your next application, review Appendix C to set up your development environment. This is not a requirement, but it will help you get the most value out of the material presented in the chapters.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This icon signifies a tip, suggestion, or general note.

Warning

This icon indicates a warning or caution.

Using Code Examples

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Programming Elastic MapReduce by Kevin J. Schmidt and Christopher Phillips (O’Reilly). Copyright 2014 Kevin Schmidt and Christopher Phillips, 978-1-449-36362-8.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

Safari® Books Online

Note

Safari Books Online is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://oreil.ly/Prog-Elastic-MapReduce.

To comment or ask technical questions about this book, send email to .

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

My wife Michelle gave me the encouragement to complete this book. Of course my employer, Dell, deserves an acknowledgment. They provided me with support to do this project. I next need to thank my co-workers who provided me with valuable input: Rob Scudiere, Wayne Haber, and Marco Arguedas. Finally, the tech reviewers provided fantastic guidance on how to make the book better: Jennifer Davis, Michael Ducy, Kirk Kimmel, Ari Hershowitz, Chris Corriere, Matthew Gast, and Russell Jurney.

—Kevin

I would like to thank my beautiful wife, Inna, and my lovely children Jacqueline and Josephine. Their kindness, humor, and love gave me inspiration and support while writing this book and through all of life’s adventures. I would also like to thank the tech reviewers for their insightful feedback that greatly improved many of the learning examples in the book. Matthew Gast, in particular, provided great feedback throughout all sections of the book, and his insights into the business and technical merits of the technologies and examples were invaluable. Wayne Haber, Rob Scudiere, Jim Birmingham, and my employer Dell deserve acknowledgment for their valuable input and regular reviews throughout the development of the book. I would finally like to thank my co-author Kevin Schmidt and my editor Courtney Nash for giving the opportunity to be part of this great book and their hard work and efforts in its development.

—Chris

Get Programming Elastic MapReduce now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.