You are previewing Pro Apache Hadoop, Second Edition.
O'Reilly logo
Pro Apache Hadoop, Second Edition

Book Description

Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop the framework of big data. Revised to cover Hadoop 2.0, the book covers the very latest developments such as YARN (aka MapReduce 2.0), new HDFS high-availability features, and increased scalability in the form of HDFS Federations. All the old content has been revised too, giving the latest on the ins and outs of MapReduce, cluster design, the Hadoop Distributed File System, and more.

This book covers everything you need to build your first Hadoop cluster and begin analyzing and deriving value from your business and scientific data. Learn to solve big-data problems the MapReduce way, by breaking a big problem into chunks and creating small-scale solutions that can be flung across thousands upon thousands of nodes to analyze large data volumes in a short amount of wall-clock time. Learn how to let Hadoop take care of distributing and parallelizing your softwareyou just focus on the code; Hadoop takes care of the rest.

  • Covers all that is new in Hadoop 2.0
  • Written by a professional involved in Hadoop since day one
  • Takes you quickly to the seasoned pro level on the hottest cloud-computing framework
  • Table of Contents

    1. Cover
    2. Title
    3. Copyright
    4. Dedication
    5. Contents at a Glance
    6. Contents
    7. About the Authors
    8. About the Technical Reviewer
    9. Acknowledgments
    10. Introduction
    11. Chapter 1 : Motivation for Big Data
      1. What Is Big Data?
      2. Key Idea Behind Big Data Techniques
        1. Data Is Distributed Across Several Nodes
        2. Applications Are Moved to the Data
        3. Data Is Processed Local to a Node
        4. Sequential Reads Preferred Over Random Reads
        5. An Example
      3. Big Data Programming Models
        1. Massively Parallel Processing (MPP) Database Systems
        2. In-Memory Database Systems
        3. MapReduce Systems
        4. Bulk Synchronous Parallel (BSP) Systems
      4. Big Data and Transactional Systems
      5. How Much Can We Scale?
        1. A Compute-Intensive Example
        2. Amdhal’s Law
      6. Business Use-Cases for Big Data
      7. Summary
    12. Chapter 2 : Hadoop Concepts
      1. Introducing Hadoop
      2. Introducing the MapReduce Model
      3. Components of Hadoop
        1. Hadoop Distributed File System (HDFS)
        2. Secondary NameNode
        3. TaskTracker
        4. JobTracker
      4. Hadoop 2.0
        1. Components of YARN
      5. HDFS High Availability
      6. Summary
    13. Chapter 3 : Getting Started with the Hadoop Framework
      1. Types of Installation
        1. Stand-Alone Mode
        2. Pseudo-Distributed Cluster
        3. Multinode Node Cluster Installation
        4. Preinstalled Using Amazon Elastic MapReduce
      2. Setting up a Development Environment with a Cloudera Virtual Machine
      3. Components of a MapReduce program
      4. Your First Hadoop Program
        1. Prerequisites to Run Programs in Local Mode
        2. WordCount Using the Old API
        3. Building the Application
        4. Running WordCount in Cluster Mode
        5. WordCount Using the New API
        6. Building the Application
        7. Running WordCount in Cluster Mode
      5. Third-Party Libraries in Hadoop Jobs
      6. Summary
    14. Chapter 4 : Hadoop Administration
      1. Hadoop Configuration Files
      2. Configuring Hadoop Daemons
      3. Precedence of Hadoop Configuration Files
      4. Diving into Hadoop Configuration Files
        1. core-site.xml
        2. hdfs-*.xml
        3. mapred-site.xml
        4. yarn-site.xml
        5. Memory Allocations in YARN
      5. Scheduler
        1. Capacity Scheduler
        2. Fair Scheduler
        3. Fair Scheduler Configuration
        4. yarn-site.xml Configurations
        5. Allocation File Format and Configurations
        6. Determine Dominant Resource Share in drf Policy
      6. Slaves File
      7. Rack Awareness
        1. Providing Hadoop with Network Topology
      8. Cluster Administration Utilities
        1. Check the HDFS
        2. Command-Line HDFS Administration
        3. Rebalancing HDFS Data
        4. Copying Large Amounts of Data from the HDFS
      9. Summary
    15. Chapter 5 : Basics of MapReduce Development
      1. Hadoop and Data Processing
      2. Reviewing the Airline Dataset
        1. Preparing the Development Environment
        2. Preparing the Hadoop System
      3. MapReduce Programming Patterns
        1. Map-Only Jobs (SELECT and WHERE Queries)
        2. Problem Definition: SELECT Clause
        3. Problem Definition: WHERE Clause
        4. Map and Reduce Jobs (Aggregation Queries)
        5. Problem Definition: GROUP BY and SUM Clauses
        6. Improving Aggregation Performance Using the Combiner
        7. Problem Definition: Optimized Aggregators
        8. Role of the Partitioner
        9. Problem Definition: Split Airline Data by Month
      4. Bringing it All Together
      5. Summary
    16. Chapter 6 : Advanced MapReduce Development
      1. MapReduce Programming Patterns
        1. Introduction to Hadoop I/O
        2. Problem Definition: Sorting
        3. Problem Definition: Analyzing Consecutive Records
        4. Problem Definition: Join Using MapReduce
        5. Problem Definition: Join Using Map-Only jobs
        6. Writing to Multiple Output Files in a Single MR Job
        7. Collecting Statistics Using Counters
      2. Summary
    17. Chapter 7 : Hadoop Input/Output
      1. Compression Schemes
        1. What Can Be Compressed?
        2. Compression Schemes
        3. Enabling Compression
      2. Inside the Hadoop I/O processes
        1. InputFormat
        2. OutputFormat
        3. Custom OutputFormat: Conversion from Text to XML
        4. Custom InputFormat: Consuming a Custom XML file
      3. Hadoop Files
        1. SequenceFile
        2. MapFiles
        3. Avro Files
      4. Summary
    18. Chapter 8 : Testing Hadoop Programs
      1. Revisiting the Word Counter
      2. Introducing MRUnit
        1. Installing MRUnit
        2. MRUnit Core Classes
        3. Writing an MRUnit Test Case
        4. Testing Counters
        5. Features of MRUnit
        6. Limitations of MRUnit
      3. Testing with LocalJobRunner
        1. Limitations of LocalJobRunner
      4. Testing with MiniMRCluster
        1. Setting up the Development Environment
        2. Example for MiniMRCluster
        3. Limitations of MiniMRCluster
      5. Testing MR Jobs with Access Network Resources
      6. Summary
    19. Chapter 9 : Monitoring Hadoop
      1. Writing Log Messages in Hadoop MapReduce Jobs
      2. Viewing Log Messages in Hadoop MapReduce Jobs
      3. User Log Management in Hadoop 2.x
        1. Log Storage in Hadoop 2.x
        2. Log Management Improvements
        3. Viewing Logs Using Web–Based UI
        4. Command-Line Interface
        5. Log Retention
      4. Hadoop Cluster Performance Monitoring
      5. Using YARN REST APIs
      6. Managing the Hadoop Cluster Using Vendor Tools
        1. Ambari Architecture
        2. Summary
    20. Chapter 10 : Data Warehousing Using Hadoop
      1. Apache Hive
        1. Installing Hive
        2. Hive Architecture
        3. Metastore
        4. Compiler Basics
        5. Hive Concepts
        6. HiveQL Compiler Details
        7. Data Definition Language
        8. Data Manipulation Language
        9. External Interfaces
        10. Hive Scripts
        11. Performance
        12. MapReduce Integration
        13. Creating Partitions
        14. User-Defined Functions
      2. Impala
        1. Impala Architecture
        2. Impala Features
        3. Impala Limitations
      3. Shark
        1. Shark/Spark Architecture
      4. Summary
    21. Chapter 11 : Data Processing Using Pig
      1. An Introduction to Pig
      2. Running Pig
        1. Executing in the Grunt Shell
        2. Executing a Pig Script
        3. Embedded Java Program
      3. Pig Latin
        1. Comments in a Pig Script
        2. Execution of Pig Statements
        3. Pig Commands
      4. User-Defined Functions
        1. Eval Functions Invoked in the Mapper
        2. Eval Functions Invoked in the Reducer
        3. Writing and Using a Custom FilterFunc
      5. Comparison of PIG versus Hive
      6. Crunch API
        1. How Crunch Differs from Pig
        2. Sample Crunch Pipeline
      7. Summary
    22. Chapter 12 : HCatalog and Hadoop in the Enterprise
      1. HCatalog and Enterprise Data Warehouse Users
      2. HCatalog: A Brief Technical Background
        1. HCatalog Command-Line Interface
        2. WebHCat
        3. HCatalog Interface for MapReduce
        4. HCatalog Interface for Pig
        5. HCatalog Notification Interface
      3. Security and Authorization in HCatalog
      4. Bringing It All Together
      5. Summary
    23. Chapter 13 : Log Analysis Using Hadoop
      1. Log File Analysis Applications
        1. Web Analytics
        2. Security Compliance and Forensics
        3. Monitoring and Alerts
        4. Internet of Things
      2. Analysis Steps
        1. Load
        2. Refine
        3. Visualize
      3. Apache Flume
        1. Core Concepts
      4. Netflix Suro
      5. Cloud Solutions
      6. Summary
    24. Chapter 14 : Building Real-Time Systems Using HBase
      1. What Is HBase?
      2. Typical HBase Use-Case Scenarios
      3. HBase Data Model
        1. HBase Logical or Client-Side View
        2. Differences Between HBase and RDBMSs
        3. HBase Tables
        4. HBase Cells
        5. HBase Column Family
      4. HBase Commands and APIs
        1. Getting a Command List: help Command
        2. Creating a Table: create Command
        3. Adding Rows to a Table: put Command
        4. Retrieving Rows from the Table: get Command
        5. Reading Multiple Rows: scan Command
        6. Counting the Rows in the Table: count Command
        7. Deleting Rows: delete Command
        8. Truncating a Table: truncate Command
        9. Dropping a Table: drop Command
        10. Altering a Table: alter Command
      5. HBase Architecture
        1. HBase Components
        2. Compaction and Splits in HBase
        3. Compaction
      6. HBase Configuration: An Overview
        1. hbase-default.xml and hbase-site.xml
      7. HBase Application Design
        1. Tall vs. Wide vs. Narrow Table Design
        2. Row Key Design
      8. HBase Operations Using Java API
        1. HBase Treats Everything as Bytes
        2. Create an HBase Table
        3. Administrative Functions Using HBaseAdmin
        4. Accessing Data Using the Java API
      9. HBase MapReduce Integration
      10. A MapReduce Job to Read an HBase Table
      11. HBase and MapReduce Clusters
        1. Scenario I: Frequent MapReduce Jobs Against HBase Tables
        2. Scenario II: HBase and MapReduce have Independent SLAs
      12. Summary
    25. Chapter 15 : Data Science with Hadoop
      1. Hadoop Data Science Methods
      2. Apache Hama
        1. Bulk Synchronous Parallel Model
        2. Hama Hello World!
        3. Monte Carlo Methods
        4. K-Means Clustering
      3. Apache Spark
        1. Resilient Distributed Datasets (RDDs)
        2. Monte Carlo with Spark
        3. KMeans with Spark
      4. RHadoop
      5. Summary
    26. Chapter 16 : Hadoop in the Cloud
      1. Economics
        1. Self-Hosted Cluster
        2. Cloud-Hosted Cluster
        3. Elasticity
        4. On Demand
        5. Bid Pricing
        6. Hybrid Cloud
      2. Logistics
        1. Ingress/Egress
        2. Data Retention
      3. Security
      4. Cloud Usage Models
      5. Cloud Providers
        1. Amazon Web Services
        2. Google Cloud Platform
        3. Microsoft Azure
        4. Choosing a Cloud Vendor
      6. Case Study: Amazon Web Services
        1. Elastic MapReduce
        2. Elastic Compute Cloud
      7. Summary
    27. Chapter 17 : Building a YARN Application
      1. YARN: A General-Purpose Distributed System
      2. YARN: A Quick Review
      3. Creating a YARN Application
        1. POM Configuration
      4. DownloadService.java Class
      5. Client.java
        1. Steps to Launch the Application Master from the Client
      6. ApplicationMaster.java
        1. Communication Protocol between Application Master and Resource Manager: Application Master Protocol
        2. Node Manager Communication Protocol: Container Management Protocol
        3. Steps to Launch the Worker Tasks
      7. Executing the Application Master
        1. Launch the Application in Un-Managed Mode
        2. Launch the Application in Managed Mode
      8. Summary
    28. Appendix A: Installing Hadoop
      1. Installing Hadoop 2.2.0 on Windows
        1. Preparing the Installation Environment
        2. Building Hadoop 2.2.0 for Windows
        3. Installing Hadoop 2.2.0 for Windows
        4. Configuring Hadoop 2.2.0
        5. Preparing the Hadoop Cluster
        6. Starting HDFS
        7. Starting MapReduce (YARN)
        8. Verifying that the Cluster Is Running
        9. Testing the Cluster
      2. Installing Hadoop 2.2.0 on Linux
    29. Appendix B: Using Maven with Eclipse
      1. A Quick Introduction to Maven
        1. Creating a Maven Project
      2. Using Maven with Eclipse
        1. Installing the m2e Maven Eclipse Plug-in
        2. Creating a Maven Project from Eclipse
        3. Building a Maven Project from Eclipse
    30. Appendix C: Apache Ambari
      1. Hadoop Components Supported by Apache Ambari
      2. Installing Apache Ambari
      3. Trying the Ambari Sandbox on Your OS
    31. Index