You are previewing Hadoop: The Definitive Guide, 3rd Edition.

Hadoop: The Definitive Guide, 3rd Edition

Cover of Hadoop: The Definitive Guide, 3rd Edition by Tom White Published by O'Reilly Media, Inc.
  1. Hadoop: The Definitive Guide
  2. Dedication
  3. Foreword
  4. Preface
    1. Administrative Notes
    2. What’s in This Book?
    3. What’s New in the Second Edition?
    4. What’s New in the Third Edition?
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  5. 1. Meet Hadoop
    1. Data!
    2. Data Storage and Analysis
    3. Comparison with Other Systems
      1. Rational Database Management System
      2. Grid Computing
      3. Volunteer Computing
    4. A Brief History of Hadoop
    5. Apache Hadoop and the Hadoop Ecosystem
    6. Hadoop Releases
      1. What’s Covered in This Book
      2. Compatibility
  6. 2. MapReduce
    1. A Weather Dataset
      1. Data Format
    2. Analyzing the Data with Unix Tools
    3. Analyzing the Data with Hadoop
      1. Map and Reduce
      2. Java MapReduce
    4. Scaling Out
      1. Data Flow
      2. Combiner Functions
      3. Running a Distributed MapReduce Job
    5. Hadoop Streaming
      1. Ruby
      2. Python
    6. Hadoop Pipes
      1. Compiling and Running
  7. 3. The Hadoop Distributed Filesystem
    1. The Design of HDFS
    2. HDFS Concepts
      1. Blocks
      2. Namenodes and Datanodes
      3. HDFS Federation
      4. HDFS High-Availability
    3. The Command-Line Interface
      1. Basic Filesystem Operations
    4. Hadoop Filesystems
      1. Interfaces
    5. The Java Interface
      1. Reading Data from a Hadoop URL
      2. Reading Data Using the FileSystem API
      3. Writing Data
      4. Directories
      5. Querying the Filesystem
      6. Deleting Data
    6. Data Flow
      1. Anatomy of a File Read
      2. Anatomy of a File Write
      3. Coherency Model
    7. Data Ingest with Flume and Sqoop
    8. Parallel Copying with distcp
      1. Keeping an HDFS Cluster Balanced
    9. Hadoop Archives
      1. Using Hadoop Archives
      2. Limitations
  8. 4. Hadoop I/O
    1. Data Integrity
      1. Data Integrity in HDFS
      2. LocalFileSystem
      3. ChecksumFileSystem
    2. Compression
      1. Codecs
      2. Compression and Input Splits
      3. Using Compression in MapReduce
    3. Serialization
      1. The Writable Interface
      2. Writable Classes
      3. Implementing a Custom Writable
      4. Serialization Frameworks
    4. Avro
      1. Avro Data Types and Schemas
      2. In-Memory Serialization and Deserialization
      3. Avro Datafiles
      4. Interoperability
      5. Schema Resolution
      6. Sort Order
      7. Avro MapReduce
      8. Sorting Using Avro MapReduce
      9. Avro MapReduce in Other Languages
    5. File-Based Data Structures
      1. SequenceFile
      2. MapFile
  9. 5. Developing a MapReduce Application
    1. The Configuration API
      1. Combining Resources
      2. Variable Expansion
    2. Setting Up the Development Environment
      1. Managing Configuration
      2. GenericOptionsParser, Tool, and ToolRunner
    3. Writing a Unit Test with MRUnit
      1. Mapper
      2. Reducer
    4. Running Locally on Test Data
      1. Running a Job in a Local Job Runner
      2. Testing the Driver
    5. Running on a Cluster
      1. Packaging a Job
      2. Launching a Job
      3. The MapReduce Web UI
      4. Retrieving the Results
      5. Debugging a Job
      6. Hadoop Logs
      7. Remote Debugging
    6. Tuning a Job
      1. Profiling Tasks
    7. MapReduce Workflows
      1. Decomposing a Problem into MapReduce Jobs
      2. JobControl
      3. Apache Oozie
  10. 6. How MapReduce Works
    1. Anatomy of a MapReduce Job Run
      1. Classic MapReduce (MapReduce 1)
      2. YARN (MapReduce 2)
    2. Failures
      1. Failures in Classic MapReduce
      2. Failures in YARN
    3. Job Scheduling
      1. The Fair Scheduler
      2. The Capacity Scheduler
    4. Shuffle and Sort
      1. The Map Side
      2. The Reduce Side
      3. Configuration Tuning
    5. Task Execution
      1. The Task Execution Environment
      2. Speculative Execution
      3. Output Committers
      4. Task JVM Reuse
      5. Skipping Bad Records
  11. 7. MapReduce Types and Formats
    1. MapReduce Types
      1. The Default MapReduce Job
    2. Input Formats
      1. Input Splits and Records
      2. Text Input
      3. Binary Input
      4. Multiple Inputs
      5. Database Input (and Output)
    3. Output Formats
      1. Text Output
      2. Binary Output
      3. Multiple Outputs
      4. Lazy Output
      5. Database Output
  12. 8. MapReduce Features
    1. Counters
      1. Built-in Counters
      2. User-Defined Java Counters
      3. User-Defined Streaming Counters
    2. Sorting
      1. Preparation
      2. Partial Sort
      3. Total Sort
      4. Secondary Sort
    3. Joins
      1. Map-Side Joins
      2. Reduce-Side Joins
    4. Side Data Distribution
      1. Using the Job Configuration
      2. Distributed Cache
    5. MapReduce Library Classes
  13. 9. Setting Up a Hadoop Cluster
    1. Cluster Specification
      1. Network Topology
    2. Cluster Setup and Installation
      1. Installing Java
      2. Creating a Hadoop User
      3. Installing Hadoop
      4. Testing the Installation
    3. SSH Configuration
    4. Hadoop Configuration
      1. Configuration Management
      2. Environment Settings
      3. Important Hadoop Daemon Properties
      4. Hadoop Daemon Addresses and Ports
      5. Other Hadoop Properties
      6. User Account Creation
    5. YARN Configuration
      1. Important YARN Daemon Properties
      2. YARN Daemon Addresses and Ports
    6. Security
      1. Kerberos and Hadoop
      2. Delegation Tokens
      3. Other Security Enhancements
    7. Benchmarking a Hadoop Cluster
      1. Hadoop Benchmarks
      2. User Jobs
    8. Hadoop in the Cloud
      1. Apache Whirr
  14. 10. Administering Hadoop
    1. HDFS
      1. Persistent Data Structures
      2. Safe Mode
      3. Audit Logging
      4. Tools
    2. Monitoring
      1. Logging
      2. Metrics
      3. Java Management Extensions
    3. Maintenance
      1. Routine Administration Procedures
      2. Commissioning and Decommissioning Nodes
      3. Upgrades
  15. 11. Pig
    1. Installing and Running Pig
      1. Execution Types
      2. Running Pig Programs
      3. Grunt
      4. Pig Latin Editors
    2. An Example
      1. Generating Examples
    3. Comparison with Databases
    4. Pig Latin
      1. Structure
      2. Statements
      3. Expressions
      4. Types
      5. Schemas
      6. Functions
      7. Macros
    5. User-Defined Functions
      1. A Filter UDF
      2. An Eval UDF
      3. A Load UDF
    6. Data Processing Operators
      1. Loading and Storing Data
      2. Filtering Data
      3. Grouping and Joining Data
      4. Sorting Data
      5. Combining and Splitting Data
    7. Pig in Practice
      1. Parallelism
      2. Parameter Substitution
  16. 12. Hive
    1. Installing Hive
      1. The Hive Shell
    2. An Example
    3. Running Hive
      1. Configuring Hive
      2. Hive Services
      3. The Metastore
    4. Comparison with Traditional Databases
      1. Schema on Read Versus Schema on Write
      2. Updates, Transactions, and Indexes
    5. HiveQL
      1. Data Types
      2. Operators and Functions
    6. Tables
      1. Managed Tables and External Tables
      2. Partitions and Buckets
      3. Storage Formats
      4. Importing Data
      5. Altering Tables
      6. Dropping Tables
    7. Querying Data
      1. Sorting and Aggregating
      2. MapReduce Scripts
      3. Joins
      4. Subqueries
      5. Views
    8. User-Defined Functions
      1. Writing a UDF
      2. Writing a UDAF
  17. 13. HBase
    1. HBasics
      1. Backdrop
    2. Concepts
      1. Whirlwind Tour of the Data Model
      2. Implementation
    3. Installation
      1. Test Drive
    4. Clients
      1. Java
      2. Avro, REST, and Thrift
    5. Example
      1. Schemas
      2. Loading Data
      3. Web Queries
    6. HBase Versus RDBMS
      1. Successful Service
      2. HBase
      3. Use Case: HBase at Streamy.com
    7. Praxis
      1. Versions
      2. HDFS
      3. UI
      4. Metrics
      5. Schema Design
      6. Counters
      7. Bulk Load
  18. 14. ZooKeeper
    1. Installing and Running ZooKeeper
    2. An Example
      1. Group Membership in ZooKeeper
      2. Creating the Group
      3. Joining a Group
      4. Listing Members in a Group
      5. Deleting a Group
    3. The ZooKeeper Service
      1. Data Model
      2. Operations
      3. Implementation
      4. Consistency
      5. Sessions
      6. States
    4. Building Applications with ZooKeeper
      1. A Configuration Service
      2. The Resilient ZooKeeper Application
      3. A Lock Service
      4. More Distributed Data Structures and Protocols
    5. ZooKeeper in Production
      1. Resilience and Performance
      2. Configuration
  19. 15. Sqoop
    1. Getting Sqoop
    2. Sqoop Connectors
    3. A Sample Import
      1. Text and Binary File Formats
    4. Generated Code
      1. Additional Serialization Systems
    5. Imports: A Deeper Look
      1. Controlling the Import
      2. Imports and Consistency
      3. Direct-mode Imports
    6. Working with Imported Data
      1. Imported Data and Hive
    7. Importing Large Objects
    8. Performing an Export
    9. Exports: A Deeper Look
      1. Exports and Transactionality
      2. Exports and SequenceFiles
  20. 16. Case Studies
    1. Hadoop Usage at Last.fm
      1. Last.fm: The Social Music Revolution
      2. Hadoop at Last.fm
      3. Generating Charts with Hadoop
      4. The Track Statistics Program
      5. Summary
    2. Hadoop and Hive at Facebook
      1. Hadoop at Facebook
      2. Hypothetical Use Case Studies
      3. Hive
      4. Problems and Future Work
    3. Nutch Search Engine
      1. Data Structures
      2. Selected Examples of Hadoop Data Processing in Nutch
      3. Summary
    4. Log Processing at Rackspace
      1. Requirements/The Problem
      2. Brief History
      3. Choosing Hadoop
      4. Collection and Storage
      5. MapReduce for Logs
    5. Cascading
      1. Fields, Tuples, and Pipes
      2. Operations
      3. Taps, Schemes, and Flows
      4. Cascading in Practice
      5. Flexibility
      6. Hadoop and Cascading at ShareThis
      7. Summary
    6. TeraByte Sort on Apache Hadoop
    7. Using Pig and Wukong to Explore Billion-edge Network Graphs
      1. Measuring Community
      2. Everybody’s Talkin’ at Me: The Twitter Reply Graph
      3. Symmetric Links
      4. Community Extraction
  21. A. Installing Apache Hadoop
    1. Prerequisites
    2. Installation
    3. Configuration
      1. Standalone Mode
      2. Pseudodistributed Mode
      3. Fully Distributed Mode
  22. B. Cloudera’s Distribution Including Apache Hadoop
  23. C. Preparing the NCDC Weather Data
  24. Index
  25. About the Author
  26. Colophon
  27. Copyright
O'Reilly logo

Chapter 13. HBase

Jonathan Gray

Michael Stack

HBasics

HBase is a distributed column-oriented database built on top of HDFS. HBase is the Hadoop application to use when you require real-time read/write random access to very large datasets.

Although there are countless strategies and implementations for database storage and retrieval, most solutions—especially those of the relational variety—are not built with very large scale and distribution in mind. Many vendors offer replication and partitioning solutions to grow the database beyond the confines of a single node, but these add-ons are generally an afterthought and are complicated to install and maintain. They also severely compromise the RDBMS feature set. Joins, complex queries, triggers, views, and foreign-key constraints become prohibitively expensive to run on a scaled RDBMS or do not work at all.

HBase comes at the scaling problem from the opposite direction. It is built from the ground up to scale linearly just by adding nodes. HBase is not relational and does not support SQL, but given the proper problem space, it is able to do what an RDBMS cannot: host very large, sparsely populated tables on clusters made from commodity hardware.

The canonical HBase use case is the webtable, a table of crawled web pages and their attributes (such as language and MIME type) keyed by the web page URL. The webtable is large, with row counts that run into the billions. Batch analytic and parsing MapReduce jobs are continuously run against the ...

The best content for your career. Discover unlimited learning on demand for around $1/day.