You are previewing HBase: The Definitive Guide.

HBase: The Definitive Guide

Cover of HBase: The Definitive Guide by Lars George Published by O'Reilly Media, Inc.
  1. HBase: The Definitive Guide
  2. Dedication
  3. Foreword
  4. Preface
    1. General Information
      1. HBase Version
      2. Building the Examples
      3. Hush: The HBase URL Shortener
      4. Running Hush
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
  5. 1. Introduction
    1. The Dawn of Big Data
    2. The Problem with Relational Database Systems
    3. Nonrelational Database Systems, Not-Only SQL or NoSQL?
      1. Dimensions
      2. Scalability
      3. Database (De-)Normalization
    4. Building Blocks
      1. Backdrop
      2. Tables, Rows, Columns, and Cells
      3. Auto-Sharding
      4. Storage API
      5. Implementation
      6. Summary
    5. HBase: The Hadoop Database
      1. History
      2. Nomenclature
      3. Summary
  6. 2. Installation
    1. Quick-Start Guide
    2. Requirements
      1. Hardware
      2. Software
    3. Filesystems for HBase
      1. Local
      2. HDFS
      3. S3
      4. Other Filesystems
    4. Installation Choices
      1. Apache Binary Release
      2. Building from Source
    5. Run Modes
      1. Standalone Mode
      2. Distributed Mode
    6. Configuration
      1. hbase-site.xml and hbase-default.xml
      2. hbase-env.sh
      3. regionserver
      4. log4j.properties
      5. Example Configuration
      6. Client Configuration
    7. Deployment
      1. Script-Based
      2. Apache Whirr
      3. Puppet and Chef
    8. Operating a Cluster
      1. Running and Confirming Your Installation
      2. Web-based UI Introduction
      3. Shell Introduction
      4. Stopping the Cluster
  7. 3. Client API: The Basics
    1. General Notes
    2. CRUD Operations
      1. Put Method
      2. Get Method
      3. Delete Method
    3. Batch Operations
    4. Row Locks
    5. Scans
      1. Introduction
      2. The ResultScanner Class
      3. Caching Versus Batching
    6. Miscellaneous Features
      1. The HTable Utility Methods
      2. The Bytes Class
  8. 4. Client API: Advanced Features
    1. Filters
      1. Introduction to Filters
      2. Comparison Filters
      3. Dedicated Filters
      4. Decorating Filters
      5. FilterList
      6. Custom Filters
      7. Filters Summary
    2. Counters
      1. Introduction to Counters
      2. Single Counters
      3. Multiple Counters
    3. Coprocessors
      1. Introduction to Coprocessors
      2. The Coprocessor Class
      3. Coprocessor Loading
      4. The RegionObserver Class
      5. The MasterObserver Class
      6. Endpoints
    4. HTablePool
    5. Connection Handling
  9. 5. Client API: Administrative Features
    1. Schema Definition
      1. Tables
      2. Table Properties
      3. Column Families
    2. HBaseAdmin
      1. Basic Operations
      2. Table Operations
      3. Schema Operations
      4. Cluster Operations
      5. Cluster Status Information
  10. 6. Available Clients
    1. Introduction to REST, Thrift, and Avro
    2. Interactive Clients
      1. Native Java
      2. REST
      3. Thrift
      4. Avro
      5. Other Clients
    3. Batch Clients
      1. MapReduce
      2. Hive
      3. Pig
      4. Cascading
    4. Shell
      1. Basics
      2. Commands
      3. Scripting
    5. Web-based UI
      1. Master UI
      2. Region Server UI
      3. Shared Pages
  11. 7. MapReduce Integration
    1. Framework
      1. MapReduce Introduction
      2. Classes
      3. Supporting Classes
      4. MapReduce Locality
      5. Table Splits
    2. MapReduce over HBase
      1. Preparation
      2. Data Sink
      3. Data Source
      4. Data Source and Sink
      5. Custom Processing
  12. 8. Architecture
    1. Seek Versus Transfer
      1. B+ Trees
      2. Log-Structured Merge-Trees
    2. Storage
      1. Overview
      2. Write Path
      3. Files
      4. HFile Format
      5. KeyValue Format
    3. Write-Ahead Log
      1. Overview
      2. HLog Class
      3. HLogKey Class
      4. WALEdit Class
      5. LogSyncer Class
      6. LogRoller Class
      7. Replay
      8. Durability
    4. Read Path
    5. Region Lookups
    6. The Region Life Cycle
    7. ZooKeeper
    8. Replication
      1. Life of a Log Edit
      2. Internals
  13. 9. Advanced Usage
    1. Key Design
      1. Concepts
      2. Tall-Narrow Versus Flat-Wide Tables
      3. Partial Key Scans
      4. Pagination
      5. Time Series Data
      6. Time-Ordered Relations
    2. Advanced Schemas
    3. Secondary Indexes
    4. Search Integration
    5. Transactions
    6. Bloom Filters
    7. Versioning
      1. Implicit Versioning
      2. Custom Versioning
  14. 10. Cluster Monitoring
    1. Introduction
    2. The Metrics Framework
      1. Contexts, Records, and Metrics
      2. Master Metrics
      3. Region Server Metrics
      4. RPC Metrics
      5. JVM Metrics
      6. Info Metrics
    3. Ganglia
      1. Installation
      2. Usage
    4. JMX
      1. JConsole
      2. JMX Remote API
    5. Nagios
  15. 11. Performance Tuning
    1. Garbage Collection Tuning
    2. Memstore-Local Allocation Buffer
    3. Compression
      1. Available Codecs
      2. Verifying Installation
      3. Enabling Compression
    4. Optimizing Splits and Compactions
      1. Managed Splitting
      2. Region Hotspotting
      3. Presplitting Regions
    5. Load Balancing
    6. Merging Regions
    7. Client API: Best Practices
    8. Configuration
    9. Load Tests
      1. Performance Evaluation
      2. YCSB
  16. 12. Cluster Administration
    1. Operational Tasks
      1. Node Decommissioning
      2. Rolling Restarts
      3. Adding Servers
    2. Data Tasks
      1. Import and Export Tools
      2. CopyTable Tool
      3. Bulk Import
      4. Replication
    3. Additional Tasks
      1. Coexisting Clusters
      2. Required Ports
    4. Changing Logging Levels
    5. Troubleshooting
      1. HBase Fsck
      2. Analyzing the Logs
      3. Common Issues
  17. A. HBase Configuration Properties
  18. B. Road Map
    1. HBase 0.92.0
    2. HBase 0.94.0
  19. C. Upgrade from Previous Releases
    1. Upgrading to HBase 0.90.x
      1. From 0.20.x or 0.89.x
      2. Within 0.90.x
    2. Upgrading to HBase 0.92.0
  20. D. Distributions
    1. Cloudera’s Distribution Including Apache Hadoop
  21. E. Hush SQL Schema
  22. F. HBase Versus Bigtable
  23. Index
  24. About the Author
  25. Colophon
  26. Copyright
O'Reilly logo

Chapter 8. Architecture

It is quite useful for advanced users (or those who are just plain adventurous) to fully comprehend how a system of their choice works behind the scenes. This chapter explains the various moving parts of HBase and how they work together.

Seek Versus Transfer

Before we look into the architecture itself, however, we will first address a more fundamental difference between typical RDBMS storage structures and alternative ones. Specifically, we will look briefly at B-trees, or rather B+ trees,[86] as they are commonly used in relational storage engines, and Log-Structured Merge Trees,[87] which (to some extent) form the basis for Bigtable’s storage architecture, as discussed in Building Blocks.

Note

Note that RDBMSes do not use B-tree-type structures exclusively, nor do all NoSQL solutions use different architectures. You will find a colorful variety of mix-and-match technologies, but with one common objective: use the best strategy for the problem at hand.

B+ Trees

B+ trees have some specific features that allow for efficient insertion, lookup, and deletion of records that are identified by keys. They represent dynamic, multilevel indexes with lower and upper bounds as far as the number of keys in each segment (also called page) is concerned. Using these segments, they achieve a much higher fanout compared to binary trees, resulting in a much lower number of I/O operations to find a specific key.

In addition, they also enable you to do range scans very efficiently, ...

The best content for your career. Discover unlimited learning on demand for around $1/day.