O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Moving Hadoop to the Cloud

Book Description

Until recently, Hadoop deployments existed on hardware owned and run by organizations. Now, of course, you can acquire the computing resources and network connectivity to run Hadoop clusters in the cloud. But there’s a lot more to deploying Hadoop to the public cloud than simply renting machines.

This hands-on guide shows developers and systems administrators familiar with Hadoop how to install, use, and manage cloud-born clusters efficiently. You’ll learn how to architect clusters that work with cloud-provider features—not just to avoid pitfalls, but also to take full advantage of these services. You’ll also compare the Amazon, Google, and Microsoft clouds, and learn how to set up clusters in each of them.

  • Learn how Hadoop clusters run in the cloud, the problems they can help you solve, and their potential drawbacks
  • Examine the common concepts of cloud providers, including compute capabilities, networking and security, and storage
  • Build a functional Hadoop cluster on cloud infrastructure, and learn what the major providers require
  • Explore use cases for high availability, relational data with Hive, and complex analytics with Spark
  • Get patterns and practices for running cloud clusters, from designing for price and security to dealing with maintenance

Table of Contents

  1. Foreword
  2. Preface
    1. Who This Book Is For
    2. What You Should Already Know
    3. What This Book Leaves Out
    4. How This Book Works
    5. Which Software Versions This Book Uses
    6. Conventions Used in This Book
      1. IP Addresses
    7. Using Code Examples
    8. O’Reilly Safari
    9. How to Contact Us
    10. Acknowledgments
  3. I. Introduction to the Cloud
  4. 1. Why Hadoop in the Cloud?
    1. What Is the Cloud?
    2. What Does Hadoop in the Cloud Mean?
    3. Reasons to Run Hadoop in the Cloud
    4. Reasons to Not Run Hadoop in the Cloud
      1. What About Security?
    5. Hybrid Clouds
    6. Hadoop Solutions from Cloud Providers
      1. Elastic MapReduce
      2. Google Cloud Dataproc
      3. HDInsight
      4. Hadoop-Like Services
      5. A Spectrum of Choices
    7. Getting Started
  5. 2. Overview and Comparison of Cloud Providers
    1. Amazon Web Services
      1. References
    2. Google Cloud Platform
      1. References
    3. Microsoft Azure
      1. References
    4. Which One Should You Use?
  6. II. Cloud Primer
  7. 3. Instances
    1. Instance Types
    2. Regions and Availability Zones
    3. Instance Control
    4. Temporary Instances
      1. Spot Instances
      2. Preemptible Instances
    5. Images
    6. No Instance Is an Island
  8. 4. Networking and Security
    1. A Drink of CIDR
    2. Virtual Networks
      1. Private DNS
      2. Public IP Addresses and DNS
    3. Virtual Networks and Regions
    4. Routing
      1. Routing in AWS
      2. Routing in Google Cloud Platform
      3. Routing in Azure
    5. Network Security Rules
      1. Inbound Versus Outbound
      2. Allow Versus Deny
      3. Network Security Rules in AWS
      4. Network Security Rules in Google Cloud Platform
      5. Network Security Rules in Azure
    6. Putting Networking and Security Together
    7. What About the Data?
  9. 5. Storage
    1. Block Storage
      1. Block Storage in AWS
      2. Block Storage in Google Cloud Platform
      3. Block Storage in Azure
    2. Object Storage
      1. Buckets
      2. Data Objects
      3. Object Access
      4. Object Storage in AWS
      5. Object Storage in Google Cloud Platform
      6. Object Storage in Azure
    3. Cloud Relational Databases
      1. Cloud Relational Databases in AWS
      2. Cloud Relational Databases in Google Cloud Platform
      3. Cloud Relational Databases in Azure
    4. Cloud NoSQL Databases
    5. Where to Start?
  10. III. A Simple Cluster in the Cloud
  11. 6. Setting Up in AWS
    1. Prerequisites
    2. Allocating Instances
      1. Generating a Key Pair
      2. Launching Instances
    3. Securing the Instances
    4. Next Steps
  12. 7. Setting Up in Google Cloud Platform
    1. Prerequisites
    2. Creating a Project
    3. Allocating Instances
      1. SSH Keys
      2. Creating Instances
    4. Securing the Instances
    5. Next Steps
  13. 8. Setting Up in Azure
    1. Prerequisites
    2. Creating a Resource Group
    3. Creating Resources
    4. SSH Keys
    5. Creating Virtual Machines
      1. The Manager Instance
      2. The Worker Instances
    6. Next Steps
  14. 9. Standing Up a Cluster
    1. The JDK
    2. Hadoop Accounts
    3. Passwordless SSH
    4. Hadoop Installation
    5. HDFS and YARN Configuration
      1. The Environment
      2. XML Configuration Files
      3. Finishing Up Configuration
    6. Startup
    7. SSH Tunneling
    8. Running a Test Job
      1. What If the Job Hangs?
    9. Running Basic Data Loading and Analysis
      1. Wikipedia Exports
      2. Analyzing a Small Export
    10. Go Bigger
  15. IV. Enhancing Your Cluster
  16. 10. High Availability
    1. Planning HA in the Cloud
      1. HDFS HA
      2. YARN HA
    2. Installing and Configuring ZooKeeper
    3. Adding New HDFS and YARN Daemons
      1. The Second Manager
      2. HDFS HA Configuration
      3. YARN HA Configuration
    4. Testing HA
    5. Improving the HA Configuration
      1. A Bigger Cluster
      2. Complete HA
      3. A Third Availability Zone?
    6. Benchmarking HA
      1. MRBench
      2. Terasort
      3. Grains of Salt
  17. 11. Relational Data with Apache Hive
    1. Planning for Hive in the Cloud
    2. Installing and Configuring Hive
    3. Startup
    4. Running Some Test Hive Queries
    5. Switching to a Remote Metastore
      1. The Remote Metastore and Stopped Clusters
    6. Hive Control Scripts
    7. Hive on S3
      1. Configuring the S3 Filesystem
      2. Adding Data to S3
      3. Configuring S3 Authentication
      4. Configuring the S3 Endpoint
      5. External Table in S3
    8. What About Google Cloud Platform and Azure?
    9. A Step Toward Transient Clusters
    10. A Different Means of Computation
  18. 12. Streaming in the Cloud with Apache Spark
    1. Planning for Spark in the Cloud
    2. Installing and Configuring Spark
    3. Startup
    4. Running Some Test Jobs
    5. Configuring Hive on Spark
      1. Add Spark Libraries to Hive
      2. Configure Hive for Spark
      3. Switch YARN to the Fair Scheduler
      4. Try Out Hive on Spark on YARN
    6. Spark Streaming from AWS Kinesis
      1. Creating a Kinesis Stream
      2. Populating the Stream with Data
      3. Streaming Kinesis Data into Spark
    7. What About Google Cloud Platform and Azure?
    8. Building Clusters Versus Building Clusters Well
  19. V. Care and Feeding of Hadoop in the Cloud
  20. 13. Pricing and Performance
    1. Picking Instance Types
      1. The Criteria
      2. General Cluster Instance Roles
    2. Persistent Versus Ephemeral Block Storage
    3. Stopping and Starting Entire Clusters
    4. Using Temporary Instances
    5. Geographic Considerations
      1. Regions
      2. Availability Zones
    6. Performance and Networking
  21. 14. Network Topologies
    1. Public and Private Subnets
      1. SSH Tunneling
      2. SOCKS Proxy
      3. VPN Access
      4. Access from Other Subnets
    2. Cluster Topologies
      1. The Public Cluster
      2. The Secured Public Cluster
      3. Gateway Instances
      4. The Private Cluster
      5. Cluster Access to the Internet and Cloud Provider Services
    3. Geographic Considerations
      1. Regions
      2. Availability Zones
    4. Starting Topologies
    5. Higher-Level Planning
  22. 15. Patterns for Cluster Usage
    1. Long-Running or Transient?
    2. Single-User or Multitenant?
    3. Self-Service or Managed?
    4. Cloud-Only or Hybrid?
    5. Watching Cost
    6. The Rising Need for Automation
  23. 16. Using Images for Cluster Management
    1. The Structure of an Image
      1. EC2 Images
      2. GCE Images
      3. Azure Images
    2. Image Preparation
      1. Wait, I’m Using That!
    3. Image Creation
      1. Image Creation in AWS
      2. Image Creation in Google Cloud Platform
      3. Image Creation in Azure
    4. Image Use
      1. Scripting Hadoop Configuration
    5. Image Maintenance
    6. Image Deletion
      1. Image Deletion in AWS
      2. Image Deletion in Google Cloud Platform
      3. Image Deletion in Azure
    7. Automated Image Creation with Packer
    8. Automated Cloud Cluster Creation
      1. Cloudera Director
      2. Hortonworks Data Cloud
      3. Qubole Data Service
      4. General System Management Tools
    9. Images or Tools?
    10. More Tooling
  24. 17. Monitoring and Automation
    1. Monitoring Choices
      1. Cloud Provider Monitoring Services
      2. Rolling Your Own
    2. Cloud Provider Command-Line Interfaces
      1. AWS CLI
      2. Google Cloud Platform CLI
      3. Azure CLI
      4. Data Formatting for CLI Results
    3. What to Monitor
      1. Instance Existence
      2. Instance Reachability
      3. Hadoop Daemon Status
      4. System Load
      5. Putting Scripting to Use
    4. Custom Metrics in CloudWatch
      1. Basic Metrics
      2. Defining a Custom Metric
      3. Feeding Custom Metric Data to CloudWatch
      4. Setting an Alarm on a Custom Metric
    5. Elastic Compute Using a Custom Metric
      1. A Custom Metric for Compute Capacity
      2. Prerequisites for Autoscaling Compute
      3. Triggering Autoscaling with an Alarm Action
      4. What About Shrinking?
      5. Other Things to Watch
    6. Ingesting Logs into CloudWatch
      1. Creating an IAM User for Log Streaming
      2. Installing the CloudWatch Agent
      3. Creating a Metric Filter
      4. Creating an Alarm from a Metric Filter
    7. So Much More to See and Do
  25. 18. Backup and Restoration
    1. Patterns to Supplement Backups
    2. Backup via Imaging
    3. HDFS Replication
      1. Cloud Storage Filesystems
      2. HDFS Snapshots
    4. Hive Metastore Replication
    5. Logs
    6. A General Cloud Hadoop Backup Strategy
    7. Not So Different, But Better
    8. To the Cloud
  26. A. Hadoop Component Start and Stop Scripts
    1. Apache ZooKeeper
    2. Apache Hive
  27. B. Hadoop Cluster Configuration Scripts
    1. SSH Key Creation and Distribution
    2. Configuration Update Script
      1. New Worker Configuration Update Script
  28. C. Monitoring Cloud Clusters with Nagios
    1. Where Nagios Should Run
    2. Instance Existence Through Ping
    3. Hosts and Host Groups
    4. Services and Service Groups
    5. Provider CLI Integration
  29. Index