You are previewing Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture.
O'Reilly logo
Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

Book Description

Plan and Implement Hadoop Virtualization for Maximum Performance, Scalability, and Business Agility

Enterprises running Hadoop must absorb rapid changes in big data ecosystems, frameworks, products, and workloads. Virtualized approaches can offer important advantages in speed, flexibility, and elasticity. Now, a world-class team of enterprise virtualization and big data experts guide you through the choices, considerations, and tradeoffs surrounding Hadoop virtualization. The authors help you decide whether to virtualize Hadoop, deploy Hadoop in the cloud, or integrate conventional and virtualized approaches in a blended solution.

First, Virtualizing Hadoop reviews big data and Hadoop from the standpoint of the virtualization specialist. The authors demystify MapReduce, YARN, and HDFS and guide you through each stage of Hadoop data management. Next, they turn the tables, introducing big data experts to modern virtualization concepts and best practices.

Finally, they bring Hadoop and virtualization together, guiding you through the decisions you’ll face in planning, deploying, provisioning, and managing virtualized Hadoop. From security to multitenancy to day-to-day management, you’ll find reliable answers for choosing your best Hadoop strategy and executing it.

Coverage includes the following:

          •        Reviewing the frameworks, products, distributions, use cases, and roles associated with Hadoop

          •        Understanding YARN resource management, HDFS storage, and I/O

          •        Designing data ingestion, movement, and organization for modern enterprise data platforms

          •        Defining SQL engine strategies to meet strict SLAs

          •        Considering security, data isolation, and scheduling for multitenant environments

          •        Deploying Hadoop as a service in the cloud

          •        Reviewing the essential concepts, capabilities, and terminology of virtualization 

          •        Applying current best practices, guidelines, and key metrics for Hadoop virtualization

          •        Managing multiple Hadoop frameworks and products as one unified system

          •        Virtualizing master and worker nodes to maximize availability and performance

          •        Installing and configuring Linux for a Hadoop environment

Table of Contents

  1. About This eBook
  2. Title Page
  3. Copyright Page
  4. We Want to Hear from You!
  5. Reader Services
  6. Dedication Page
  7. About the Authors
  8. Contributor
  9. About the Technical Editor
  10. Acknowledgments
  11. Contents at a Glance
  12. Contents
  13. Foreword
  14. Preface
    1. Motivation for Writing This Book
    2. Prerequisites
    3. Who Should Read This Book
    4. How to Use This Book
  15. Part I: Introduction to Hadoop
    1. Chapter 1. Understanding the Big Data World
      1. The Data Revolution
      2. Traditional Data Systems
        1. Semi-Structured and Unstructured Data
        2. Causation and Correlation
        3. Data Challenges
      3. The Modern Data Architecture
        1. Organizational Transformation
      4. Industry Transformation
      5. Summary
    2. Chapter 2. Hadoop Fundamental Concepts
      1. Types of Data in Hadoop
      2. Use Cases
      3. What Is Hadoop?
      4. Hadoop Distributions
      5. Hadoop Frameworks
      6. NoSQL Databases
        1. What Is NoSQL?
      7. A Hadoop Cluster
      8. Hadoop Software Processes
        1. Hadoop Hardware Profiles
      9. Roles in the Hadoop Environment
      10. Summary
    3. Chapter 3. YARN and HDFS
      1. A Hadoop Cluster Is Distributed
      2. Hadoop Directory Layouts
        1. Hadoop Operating System Users
      3. The Hadoop Distributed File System
        1. YARN Logging
        2. The NameNode
        3. The DataNode
        4. Block Placement
        5. NameNode Configurations and Managing Metadata
      4. Rack Awareness
        1. Block Management
        2. The Balancer
        3. Maintaining Data Integrity in the Cluster
        4. Quotas and Trash
      5. YARN and the YARN Processing Model
        1. Running Applications on YARN
        2. Resource Schedulers
        3. Benchmarking
        4. TeraSort Benchmarking Suite
      6. Summary
    4. Chapter 4. The Modern Data Platform
      1. Designing a Hadoop Cluster
        1. Enterprise Data Movement
      2. Summary
    5. Chapter 5. Data Ingestion
      1. Extraction, Loading, and Transformation (ELT)
        1. Sqoop: Data Movement with SQL Sources
        2. Flume: Streaming Data
        3. Oozie: Scheduling and Workflow
        4. Falcon: Data Lifecycle Management
        5. Kafka: Real-time Data Streaming
      2. Summary
    6. Chapter 6. Hadoop SQL Engines
      1. Where SQL Was Born
      2. SQL in Hadoop
      3. Hadoop SQL Engines
        1. Selecting the SQL Tool For Hadoop
      4. Now Getting Groovy with Hive and Pig
        1. Hive
        2. HCatalog
        3. Pig
      5. Summary
    7. Chapter 7. Multitenancy in Hadoop
      1. Securing the Access
        1. Authentication
        2. Auditing
        3. Authorization
        4. Data Protection
        5. Isolating the Data
        6. Isolating the Process
      2. Summary
  16. Part II: Introduction to Virtualization
    1. Chapter 8. Virtualization Fundamentals
      1. Why Virtualize Hadoop?
        1. Introduction to Virtualization
      2. Summary
      3. References
    2. Chapter 9. Best Practices for Virtualizing Hadoop
      1. Running Virtualized Hadoop with Purpose and Discipline
        1. The Discipline of Purpose Starts with a Clear Target
        2. Virtualizing Different Tiers of Hadoop
        3. Industry Best Practices
      2. Summary
  17. Part III: Virtualizing Hadoop
    1. Chapter 10. Virtualizing Hadoop
      1. How Are Hadoop Ecosystems Going to Be Managed?
        1. Building an Enterprise Hadoop Platform That Is Agile and Flexible
        2. Clarification of Terms
        3. The Journey from Bare-Metal to Virtualization
      2. Why Consider Virtualizing Hadoop?
        1. Benefits of Virtualizing Hadoop
        2. Virtualized Hadoop Can Run as Fast or Faster Than Native
        3. Coordination and Cross-Purpose Specialization Is the Future
        4. Barriers Can Be Organizational
        5. Virtualization Is Not an All or Nothing Option
        6. Rapid Provisioning and Improving Quality of Development and Test Environments
        7. Improve High Availability with Virtualization
        8. Use Virtualization to Leverage Hadoop Workloads
        9. Hadoop in the Cloud
        10. Big Data Extensions
        11. The Path to Virtualization
        12. The Software-Defined Data Center
        13. Virtualizing the Network
        14. vRealize Suite
      3. Summary
      4. References
    2. Chapter 11. Virtualizing Hadoop Master Servers
      1. Virtualizing Servers in a Hadoop Cluster
        1. Virtualizing the Environment Around Hadoop
        2. Virtualizing the Master Hadoop Servers
        3. Virtualizing Without the SAN
      2. Summary
    3. Chapter 12. Virtualizing the Hadoop Worker Nodes
      1. A Brief Introduction to the Worker Nodes in Hadoop
      2. Deployment Models for Hadoop Clusters
        1. The Combined Model
        2. The Separated Model
        3. Network Effects of the Data-Compute Separation
        4. The Shared-Storage Approach to the Data-Compute Separated Model
        5. Local Disks for the Application’s Temporary Data
        6. The Shared Storage Architecture Model Using Network-Attached Storage (NAS)
        7. Deployment Model Summary
      3. Best Practices for Virtualizing Hadoop Workers
        1. Disk I/O
      4. The Hadoop Virtualization Extensions (HVE)
      5. Summary
      6. References
      7. Resources
    4. Chapter 13. Deploying Hadoop as a Service in the Private Cloud
      1. The Cloud Context
        1. Stakeholders for Hadoop
        2. Overview of the Solution Architecture
      2. Summary
      3. References
    5. Chapter 14. Understanding the Installation of Hadoop
      1. Map the Right Solutions to the Right Use Case
        1. Thoughts About Installing Hadoop
      2. Configuring Repositories
        1. Installing HDP 2.2
        2. Environment Preparation
      3. Setting Up the Hadoop Configuration
      4. Starting HDFS and YARN
        1. Start YARN
        2. Verifying MapReduce Functionality
      5. Installing and Configuring Hive
      6. Installing and Configuring MySQL Database
      7. Installing and Configuring Hive and HCatalog
      8. Summary
    6. Chapter 15. Configuring Linux for Hadoop
      1. Supported Linux Platforms
      2. Different Deployment Models
      3. Linux Golden Templates
        1. Building a Linux Enterprise Hadoop Platform
        2. Selecting the Linux Distribution
      4. Optimal Linux Kernel Parameters and System Settings
        1. epoll
        2. Disable Swap Space
        3. Disable Security During Install
        4. IO Scheduler Tuning
        5. Check Transparent Huge Pages Configuration
        6. Limits.conf
        7. Partition Alignment for RDMs
        8. File System Considerations
        9. Lazy Count Parameter for XFS
        10. Mount Options
        11. I/O Scheduler
        12. Disk Read and Write Options
        13. Storage Benchmarking
        14. Java Version
        15. Set Up NTP
        16. Enable Jumbo Frames
        17. Additional Network Considerations
      5. Summary
  18. Appendix A. Hadoop Cluster Creation: A Prerequisite Checklist
  19. Appendix B. Big Data/Hadoop on VMware vSphere Reference Materials
    1. Deployment Guides
    2. Reference Architectures
    3. Customer Case Studies
    4. Performance
    5. vSphere Big Data Extensions (BDE)
    6. Other vSphere Features and Big Data
  20. Index
  21. Code Snippets