You are previewing Learning Big Data with Amazon Elastic MapReduce.
O'Reilly logo
Learning Big Data with Amazon Elastic MapReduce

Book Description

Easily learn, build, and execute real-world Big Data solutions using Hadoop and AWS EMR

In Detail

Amazon Elastic MapReduce is a web service used to process and store vast amount of data, and it is one of the largest Hadoop operators in the world. With the increase in the amount of data generated and collected by many businesses and the arrival of cost-effective cloud-based solutions for distributed computing, the feasibility to crunch large amounts of data to get deep insights within a short span of time has increased greatly.

This book will get you started with AWS so that you can quickly create your own account and explore the services provided, many of which you might be delighted to use. This book covers the architectural details of the MapReduce framework, Apache Hadoop, various job models on EMR, how to manage clusters on EMR, and the command-line tools available with EMR. Each chapter builds on the knowledge of the previous one, leading to the final chapter where you will learn about solving a real-world use case using Apache Hadoop and EMR. This book will, therefore, get you up and running with major Big Data technologies quickly and efficiently.

What You Will Learn

  • Create and access your account on AWS and learn about its various services
  • Launch a machine on the cloud infrastructure of AWS, get login credentials, and communicate with that machine
  • Learn about the logical dataflow of MapReduce and how it uses distributed computing effectively
  • Understand the benefits of EMR over a local Hadoop cluster
  • Discover the best practices that should be kept in mind while planning and executing a cluster/job on EMR
  • Launch a cluster on Amazon EMR, submit the Hello World wordcount job for processing, and download and view the results
  • Execute jobs on EMR using the two primary methods provided by EMR
  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Learning Big Data with Amazon Elastic MapReduce
      1. Table of Contents
      2. Learning Big Data with Amazon Elastic MapReduce
      3. Credits
      4. About the Authors
      5. Acknowledgments
      6. About the Reviewers
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
          3. Instant updates on new Packt books
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      9. 1. Amazon Web Services
        1. What is Amazon Web Services?
        2. Structure and Design
          1. Regions
          2. Availability Zones
        3. Services provided by AWS
          1. Compute
            1. Amazon EC2
            2. Auto Scaling
            3. Elastic Load Balancing
            4. Amazon Workspaces
          2. Storage
            1. Amazon S3
            2. Amazon EBS
            3. Amazon Glacier
            4. AWS Storage Gateway
            5. AWS Import/Export
          3. Databases
            1. Amazon RDS
            2. Amazon DynamoDB
            3. Amazon Redshift
            4. Amazon ElastiCache
          4. Networking and CDN
            1. Amazon VPC
            2. Amazon Route 53
            3. Amazon CloudFront
            4. AWS Direct Connect
          5. Analytics
            1. Amazon EMR
            2. Amazon Kinesis
            3. AWS Data Pipeline
          6. Application services
            1. Amazon CloudSearch (Beta)
            2. Amazon SQS
            3. Amazon SNS
            4. Amazon SES
            5. Amazon AppStream
            6. Amazon Elastic Transcoder
            7. Amazon SWF
          7. Deployment and Management
            1. AWS Identity and Access Management
            2. Amazon CloudWatch
            3. AWS Elastic Beanstalk
            4. AWS CloudFormation
            5. AWS OpsWorks
            6. AWS CloudHSM
            7. AWS CloudTrail
          8. AWS Pricing
        4. Creating an account on AWS
          1. Step 1 – Creating an Amazon.com account
          2. Step 2 – Providing a payment method
          3. Step 3 – Identity verification by telephone
          4. Step 4 – Selecting the AWS support plan
        5. Launching the AWS management console
        6. Getting started with Amazon EC2
          1. How to start a machine on AWS?
            1. Step 1 – Choosing an Amazon Machine Image
            2. Step 2 – Choosing an instance type
            3. Step 3 – Configuring instance details
            4. Step 4 – Adding storage
            5. Step 5 – Tagging your instance
            6. Step 6 – Configuring a security group
          2. Communicating with the launched instance
          3. EC2 instance types
            1. General purpose
              1. M3 instance sizes
            2. Memory optimized
              1. R3 instance sizes
            3. Compute optimized
              1. C3 instance sizes
        7. Getting started with Amazon S3
          1. Creating a S3 bucket
            1. Bucket naming
            2. S3cmd
        8. Summary
      10. 2. MapReduce
        1. The map function
        2. The reduce function
          1. Divide and conquer
        3. What is MapReduce?
          1. The map reduce function models
            1. The map function model
            2. The reduce function model
        4. Data life cycle in the MapReduce framework
          1. Creation of input data splits
            1. Record reader
          2. Mapper
          3. Combiner
          4. Partitioner
          5. Shuffle and sort
          6. Reducer
        5. Real-world examples and use cases of MapReduce
          1. Social networks
          2. Media and entertainment
          3. E-commerce and websites
          4. Fraud detection and financial analytics
          5. Search engines and ad networks
          6. ETL and data analytics
        6. Software distributions built on the MapReduce framework
          1. Apache Hadoop
          2. MapR
          3. Cloudera distribution
        7. Summary
      11. 3. Apache Hadoop
        1. What is Apache Hadoop?
        2. Hadoop modules
        3. Hadoop Distributed File System
          1. Major architectural goals of HDFS
          2. Block replication and rack awareness
          3. The HDFS architecture
            1. NameNode
            2. DataNode
        4. Apache Hadoop MapReduce
          1. Hadoop MapReduce 1.x
            1. JobTracker
            2. TaskTracker
          2. Hadoop MapReduce 2.0
            1. Hadoop YARN
              1. How does YARN work?
                1. ResourceManager
                2. NodeManager
                3. ApplicationMaster
                4. Container
        5. Apache Hadoop as a platform
          1. Apache Pig
          2. Apache Hive
        6. Summary
      12. 4. Amazon EMR – Hadoop on Amazon Web Services
        1. What is AWS EMR?
          1. Features of EMR
          2. Accessing Amazon EMR features
          3. Programming on AWS EMR
        2. The EMR architecture
          1. Types of nodes
          2. EMR Job Flow and Steps
            1. Job Steps
              1. What if the Job Step fails?
            2. An EMR cluster
              1. Keep alive
              2. Termination protection
          3. Hadoop filesystem on EMR – S3 and HDFS
        3. EMR use cases
          1. Web log processing
          2. Clickstream analysis
          3. Product recommendation engine
          4. Scientific simulations
          5. Data transformations
        4. Summary
      13. 5. Programming Hadoop on Amazon EMR
        1. Hello World in Hadoop
          1. Development Environment Setup
            1. Step 1 – Installing the Eclipse IDE
            2. Step 2 – Downloading Hadoop 2.2.0
            3. Step 3 – Unzipping Hadoop Distribution
            4. Step 4 – Creating a new Java project in Eclipse
            5. Step 5 – Adding dependencies to the project
        2. Mapper implementation
          1. Setup
          2. Map
          3. Cleanup
          4. Run
        3. Reducer implementation
          1. Reduce
          2. Run
        4. Driver implementation
          1. Building a JAR
          2. Executing the solution locally
            1. Verifying the output
        5. Summary
      14. 6. Executing Hadoop Jobs on an Amazon EMR Cluster
        1. Creating an EC2 key pair
        2. Creating a S3 bucket for input data and JAR
        3. How to launch an EMR cluster
          1. Step 1 – Opening the Elastic MapReduce dashboard
          2. Step 2 – Creating an EMR cluster
          3. Step 3 – The cluster configuration
          4. Step 4 – Tagging an EMR cluster
          5. Step 5 – The software configuration
          6. Step 6 – The hardware configuration
            1. Network
            2. EC2 availability zone
            3. EC2 instance(s) configurations
          7. Step 7 – Security and access
          8. Step 8 – Adding Job Steps
        4. Viewing results
        5. Summary
      15. 7. Amazon EMR – Cluster Management
        1. EMR cluster management – different methods
        2. EMR bootstrap actions
          1. Configuring Hadoop
          2. Configuring daemons
          3. Run if
          4. Memory-intensive configuration
          5. Custom action
        3. EMR cluster monitoring and troubleshooting
          1. EMR cluster logging
            1. Hadoop logs
            2. Bootstrap action logs
            3. Job Step logs
            4. Cluster instance state logs
          2. Connecting to the master node
          3. Websites hosted on the master node
            1. Creating an SSH tunnel to the master node
            2. Configuring FoxyProxy
              1. Installing FoxyProxy in Google Chrome
              2. Creating a proxy setting
          4. EMR cluster performance monitoring
            1. Adding Ganglia to a cluster
            2. EMR cluster debugging – console
        4. EMR best practices
          1. Data transfer
          2. Data compression
          3. Cluster size and instance type
          4. Hadoop configuration and MapReduce tuning
          5. Cost optimization
        5. Summary
      16. 8. Amazon EMR – Command-line Interface Client
        1. EMR – CLI client installation
          1. Step 1 – Installing Ruby
          2. Step 2 – Installing and verifying RubyGems framework
          3. Step 3 – Installing an EMR CLI client
          4. Step 4 – Configuring AWS EMR credentials
          5. Step 5 – SSH access setup and configuration
          6. Step 6 – Verifying the EMR CLI installation
        2. Launching and monitoring an EMR cluster using CLI
          1. Launching an EMR cluster from command line
            1. Adding Job Steps to the cluster
            2. Listing and getting details of EMR clusters
            3. Terminating an EMR cluster
          2. Using spot instances with EMR
        3. Summary
      17. 9. Hadoop Streaming and Advanced Hadoop Customizations
        1. Hadoop streaming
          1. How streaming works
          2. Wordcount example with streaming
            1. Mapper
            2. Reducer
          3. Streaming command options
            1. Mandatory parameters
            2. Optional parameters
          4. Using a Java class name as mapper/reducer
          5. Using generic command options with streaming
          6. Customizing key-value splitting
          7. Using Hadoop partitioner class
          8. Using Hadoop comparator class
        2. Adding streaming Job Step on EMR
          1. Using the AWS management console
          2. Using the CLI client
            1. Launching a streaming cluster using the CLI client
        3. Advanced Hadoop customizations
          1. Custom partitioner
            1. Using a custom partitioner
          2. Custom sort comparator
            1. Using custom sort comparator
        4. Emitting results to multiple outputs
          1. Using MultipleOutputs
            1. Usage in the Driver class
            2. Usage in the Reducer class
            3. Emitting outputs in different directories based on key and value
        5. Summary
      18. 10. Use Case – Analyzing CloudFront Logs Using Amazon EMR
        1. Use case definition
        2. The solution architecture
        3. Creating the Hadoop Job Step
          1. Inputs and required libraries
            1. Input – CloudFront access logs
            2. Input – IP to city/country mapping database
            3. Required libraries
          2. Driver class implementation
          3. Mapper class implementation
          4. Reducer class implementation
          5. Testing the solution locally
          6. Executing the solution on EMR
        4. Output ingestion to a data store
        5. Using a visualization tool – Tableau Desktop
          1. Setting up Tableau Desktop
          2. Creating a new worksheet and connecting to the data store
          3. Creating a request count per country graph
          4. Other possible graphs
            1. Request count per HTTP status code
            2. Request count per edge location
            3. Bytes transferred per country
        6. Summary
      19. Index