O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Hadoop in the Enterprise: Architecture

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles. You’ll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

This practical book provides a comprehensive guide on how to make enterprise Hadoop integration successful. You’ll learn how to build a Hadoop infrastructure, architect an enterprise Hadoop platform, and even take Hadoop to the cloud.

Ideal for engineers, technical and enterprise architects, and technical leads, this guide includes lots of practical examples that can be easily comprehended by technical architects, and put into practice by an engineer, developer, and/or cluster operator.

Table of Contents

  1. I. Infrastructure
  2. 1. Clusters
    1. Building Solutions
    2. Single vs. Many Clusters
    3. Multitenancy
    4. Backup & Disaster Recovery
    5. Cloud Services
    6. Provisioning
    7. Summary
  3. 2. Compute & Storage
    1. Computer Architecture for Hadoop
      1. Commodity Servers
      2. Non-Uniform Memory Access
      3. Server CPUs & RAM
      4. The Linux Storage Stack
    2. Server Form Factors
      1. Form Factor Price Comparison
    3. Workload Profiles
      1. Other Form Factors
    4. Cluster Configurations and Node Types
      1. Master Nodes
      2. Worker Nodes
      3. Utility Nodes
      4. Edge Nodes
      5. Small Cluster Configurations
      6. Medium Cluster Configurations
      7. Large Cluster Configurations
  4. 3. Organizational Challenges
    1. Who runs it?
    2. Is it infrastructure, middleware or an application?
    3. Case study: A typical Business Intelligence Project
      1. Solution Overview
      2. Typical Team Setup
      3. Compartmentalization of IT
      4. Revised Team Setup for Hadoop in the Enterprise
    4. Commercial Issues
      1. Service Offering, Multi-Tenancy and Chargeback
      2. Capital Invest and Service Vacancy
    5. Summary
  5. II. Platform
  6. 4. Platform Validation
    1. What is Platform Validation?
    2. Testing Methodology
    3. Useful Tools
    4. Hardware Validation
      1. CPU
      2. Disks
      3. Network
    5. Hadoop Validation
      1. HDFS Validation
      2. General Validation
    6. Validating Other Components
      1. YCSB
      2. TPC-DS and TPC-H
      3. Load Testing
      4. Specific Tools
    7. Summary
  7. 5. High Availability
    1. Planning for Failure
    2. What do we mean by High Availability?
      1. Lateral or Service HA
      2. Vertical or Systemic HA
      3. Automatic or Manual Failover
    3. How available does it need to be?
      1. Service Level Objectives
      2. Percentages
      3. Percentiles
    4. Operating for High Availability
      1. Monitoring
      2. Playbooks
    5. High Availability Building Blocks
      1. Quorums
      2. Load Balancing
      3. Database HA
      4. Ancillary Services
    6. High Availability of Hadoop Services
      1. General considerations
      2. ZooKeeper
      3. HDFS
      4. YARN
      5. HBase High Availability
      6. KMS
      7. Hive
      8. Impala
      9. Solr
      10. Oozie
      11. Flume
      12. Hue
      13. Laying out the Services
  8. III. Taking Hadoop to the Cloud
  9. 6. Automated provisioning
    1. Long-lived clusters
      1. Configuration and templating
      2. Phase 0—Environment configuration
      3. Phase 1—Instance provisioning
      4. Phase 2—Instance configuration
      5. Phase 3—Cluster installation and configuration
      6. Phase 4—Post-install tasks
      7. Vendor solutions
      8. One-Click Deployments
      9. Home-grown automation
      10. Hooking into a provisioning lifecycle
      11. Scaling up and down
      12. Deploying with security
    2. Transient Clusters
    3. Sharing metadata services
    4. Summary
  10. 7. Security in the Cloud
    1. Risk
    2. Threat Model
      1. Environmental Risks
      2. Deployment Risks
      3. Application Risks
      4. Mitigations
    3. Authentication and Authorization
      1. Where to run the identity service?
      2. Cloud Service Security
    4. Auditing
    5. Encryption for data in flight
    6. Encryption for data at Rest
      1. Requirements for encryption
      2. Options for encryption in the cloud
      3. On-Premise Key Persistence
      4. Encryption via the Cloud Provider
      5. Server Side and Client Side Encryption
      6. Bring Your Own Key
      7. Encryption in Amazon Web Services
      8. Encryption in Microsoft Azure
      9. Encryption in Google Cloud Platform
      10. Recommendations and Summary for Cloud Encryption
    7. Perimeter Controls
      1. General Concepts
      2. Google Cloud Platform
      3. AWS
      4. Azure
      5. Summary
    8. Summary
  11. 8. High Availability in the Cloud
    1. Why do I need HA in the cloud?
    2. Availability of Compute
      1. Cluster Availability
      2. Node Availability
    3. Data Availability
      1. Block Storage
      2. Object Storage
      3. Storage Summary
    4. Network Availability
    5. Service Availability
      1. Databases
      2. Load Balancers
    6. Summary