You are previewing Accumulo.
O'Reilly logo
Accumulo

Book Description

Get up to speed on Apache Accumulo, the flexible, high-performance key/value store created by the National Security Agency (NSA) and based on Google’s BigTable data storage system. Written by former NSA team members, this comprehensive tutorial and reference covers Accumulo architecture, application development, table design, and cell-level security.

With clear information on system administration, performance tuning, and best practices, this book is ideal for developers seeking to write Accumulo applications, administrators charged with installing and maintaining Accumulo, and other professionals interested in what Accumulo has to offer. You will find everything you need to use this system fully.

Table of Contents

  1. Foreword
  2. Preface
    1. Goals and Audience
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
  3. 1. Architecture and Data Model
    1. Recent Trends
    2. The Role of Databases
    3. Distributed Applications
    4. Fast Random Access
      1. Accessing Sorted Versus Unsorted Data
    5. Versions
    6. History
    7. Data Model
      1. Rows and Columns
      2. Data Modification and Timestamps
    8. Advanced Data Model Components
      1. Column Families
      2. Column Visibility
      3. Full Data Model
    9. Tables
    10. Introduction to the Client API
      1. Approach to Rows
      2. Exploiting Sort Order
    11. Architecture Overview
      1. ZooKeeper
      2. Hadoop
      3. Accumulo
      4. A Typical Cluster
    12. Additional Features
      1. Automatic Data Partitioning
      2. High Consistency
      3. Automatic Load Balancing
      4. Massive Scalability
      5. Failure Tolerance and Automatic Recovery
      6. Support for Analysis: Iterators
      7. Support for Analysis: MapReduce Integration
      8. Data Lifecycle Management
      9. Compression
      10. Robust Timestamps
    13. Accumulo and Other Data Management Systems
      1. Comparisons to Relational Databases
      2. Comparisons to Other NoSQL Databases
    14. Use Cases Suited for Accumulo
      1. A New Kind of Flexible Analytical Warehouse
      2. Building the Next Gmail
      3. Massive Graph or Machine-Learning Problems
      4. Relieving Relational Databases
      5. Massive Search Applications
      6. Applications with a Long History of Versioned Data
  4. 2. Quick Start
    1. Demo of the Shell
      1. The help Command
      2. Creating a Table and Inserting Some Data
      3. Scanning for Data
      4. Using Authorizations
      5. Using a Simple Iterator
    2. Demo of Java Code
      1. Creating a Table and Inserting Some Data
      2. Scanning for Data
      3. Using Authorizations
      4. Using a Simple Iterator
    3. A More Complete Installation
    4. Other Important Resources
    5. One Last Example with a Unit Test
    6. Additional Resources
  5. 3. Basic API
    1. Development Environment
      1. Obtaining the Client Library
      2. Using Maven
      3. Configuring the Classpath
    2. Introduction to the Example Application: Wikipedia Pages
      1. Wikipedia Data
      2. Data Modeling
      3. Obtaining Example Code
      4. Downloading Sample Wikipedia Pages
      5. Downloading All English Wikipedia Articles
    3. Connect
    4. Insert
      1. Committing Mutations
      2. Handling Errors
      3. Insert Example
      4. Using Lexicoders
      5. Writing to Multiple Tables
    5. Lookups and Scanning
      1. Lookup Example
      2. Crafting Ranges
      3. Grouping by Rows
      4. Reusing Scanners
      5. Isolated Row Views
      6. Tuning Scanners
    6. Batch Scanning
    7. Update: Overwrite
      1. Overwrite Example
      2. Allowing Multiple Versions
    8. Update: Appending or Incrementing
    9. Update: Read-Modify-Write and Conditional Mutations
      1. Conditional Mutation API
      2. Conditional Mutation Batch API
      3. Conditional Mutation Example
    10. Delete
      1. Deleting and Reinserting
      2. Removing Deleted Data from Disk
      3. Batch Deleter
    11. Testing
      1. MockAccumulo
      2. MiniAccumuloCluster
  6. 4. Table API
    1. Basic Table Operations
      1. Creating Tables
      2. Renaming
      3. Deleting Tables
      4. Deleting Ranges of Rows
      5. Deleting Entries Returned from a Scan
      6. Configuring Table Properties
      7. Locality Groups
      8. Bloom Filters
      9. Caching
      10. Tablet Splits
      11. Compacting
      12. Additional Properties
      13. Online Status
      14. Cloning
      15. Importing and Exporting Tables
      16. Additional Administrative Methods
    2. Table Namespaces
      1. Creating
      2. Renaming
      3. Setting Namespace Properties
      4. Deleting
      5. Configuring Iterators
      6. Configuring Constraints
      7. Testing Class Loading for a Namespace
    3. Instance Operations
      1. Setting Properties
      2. Cluster Information
      3. Precedence of Properties
  7. 5. Security API
    1. Authentication
    2. Permissions
      1. System Permissions
      2. Namespace Permissions
      3. Table Permissions
    3. Authorizations
      1. Column Visibilities
      2. Limiting Authorizations Written
      3. An Example of Using Authorizations
      4. Using a Default Visibility
      5. Making Authorizations Work
    4. Auditing Security Operations
    5. Custom Authentication, Permissions, and Authorization
      1. Custom Authentication Example
    6. Other Security Considerations
      1. Using an Application Account for Multiple Users
      2. Network
      3. Disk Encryption
  8. 6. Server-Side Functionality and <span xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" class="keep-together">External Clients</span>
    1. Constraints
      1. Constraint Configuration API
      2. Constraint Configuration Example
      3. Creating Custom Constraints
      4. Custom Constraint Example
    2. Iterators
      1. Iterator Configuration API
      2. VersioningIterator
      3. Iterator Configuration Example
      4. Adding Iterators by Setting Properties
      5. Filtering Iterators
      6. Combiners
      7. Other Built-in Iterators
    3. Thrift Proxy
      1. Starting a Proxy
      2. Python Example
      3. Generating Client Code
    4. Language-Specific Clients
    5. Integration with Other Tools
      1. Apache Hive
      2. Apache Pig
      3. Apache Kafka
    6. Integration with Analytical Tools
  9. 7. MapReduce API
    1. Formats
    2. Writing Worker Classes
    3. MapReduce Example
    4. MapReduce over Underlying RFiles
      1. Example of Running a MapReduce Job over RFiles
    5. Delivering Rows to Map Workers
    6. Ingesters and Combiners as MapReduce Computations
    7. MapReduce and Bulk Import
      1. Bulk Ingest to Avoid Duplicates
  10. 8. Table Design
    1. Single-Table Designs
      1. Implementing Paging
    2. Secondary Indexing
      1. Index Partitioned by Term
      2. Querying a Term-Partitioned Index
      3. Maintaining Consistency Across Tables
      4. Index Partitioned by Document
      5. Querying a Document-Partitioned Index
      6. Indexing Data Types
    3. Full-Text Search
      1. wikipediaMetadata
      2. wikipediaIndex
      3. wikipedia
      4. wikipediaReverseIndex
      5. Ingesting WikiSearch Data
      6. Querying the WikiSearch Data
    4. Designing Row IDs
      1. Lexicoders
      2. Composite Row IDs
      3. Key Size
      4. Avoiding Hotspots
      5. Designing Row IDs for Consistent Updates
    5. Designing Values
      1. Storing Files and Large Values
      2. Human-Readable Versus Binary Values and Formatters
    6. Designing Authorizations
    7. Designing Column Visibilities
  11. 9. Advanced Table Designs
    1. Time-Ordered Data
    2. Graphs
      1. Building an Example Graph: Twitter
      2. Traversing Graph Tables
      3. Traversing the Example Twitter Graph
    3. Semantic Triples
      1. Semantic Triples Example
    4. Spatial Data
      1. Open Source Projects
      2. Space-Filling Curves
    5. Multidimensional Data
    6. D4M and Matlab
      1. D4M Example
    7. Machine Learning
      1. Storing Feature Vectors
      2. A Machine-Learning Example
    8. Approximating Relational and SQL Database Properties
      1. Schema Constraints
      2. SQL Operations
  12. 10. Internals
    1. Tablet Server
      1. Write Path
      2. Read Path
      3. Resource Manager
      4. Write-Ahead Logs
      5. File formats
      6. Caching
    2. Master
      1. FATE
      2. Load Balancer
    3. Garbage Collector
    4. Monitor
    5. Tracer
    6. Client
      1. Locating Keys
    7. Metadata Table
    8. Uses of ZooKeeper
    9. Accumulo and the CAP Theorem
  13. 11. Administration: Setup
    1. Preinstallation
      1. Operating Systems
      2. Kernel Tweaks
      3. Native Libraries
      4. User Accounts
      5. Linux Filesystem
      6. System Services
      7. Software Dependencies
    2. Installation
      1. Tarball Distribution Install
      2. Installing on Cloudera’s CDH
      3. Installing on Hortonworks’ HDP
      4. Installing on MapR
      5. Running via Amazon Web Services
      6. Building from Source
    3. Configuration
      1. File Permissions
      2. Server Configuration Files
      3. Client Configuration
      4. Deploying JARs
      5. Setting Up Automatic Failover
      6. Initialization
    4. Running Very Large-Scale Clusters
      1. Networking
      2. Limits
      3. Metadata Table
      4. Tablet Sizing
      5. File Sizing
      6. Using Multiple HDFS Volumes
    5. Security
      1. Column Visibilities and Accumulo Clients
      2. Supporting Software Security
      3. Network Security
      4. Encryption of Data at Rest
      5. Kerberized Hadoop
      6. Application Permissions
  14. 12. Administration: Running
    1. Starting Accumulo
      1. Via the start-all.sh Script
      2. Via init.d Scripts
    2. Stopping Accumulo
      1. Via the stop-all.sh Script
      2. Via init.d scripts
      3. Stopping Individual Processes
    3. Starting After a Crash
    4. Monitoring
      1. Monitor Web Service
      2. JMX Metrics
      3. Logging
      4. Tracing
    5. Cluster Changes
      1. Adding New Worker Nodes
      2. Removing Worker Nodes
      3. Adding New Control Nodes
      4. Removing Control Nodes
    6. Table Operations
      1. Changing Settings
      2. Changing Online Status
      3. Cloning
      4. Import, Export, and Backups
    7. Data Lifecycle
      1. Versioning
      2. Data Age-off
      3. Compactions
      4. Merging Tablets
      5. Garbage Collection
    8. Failure Recovery
      1. Typical Failures
      2. More-Serious Failures
      3. Tips for Restoring a Cluster
      4. Troubleshooting
  15. 13. Performance
    1. Understanding Read Performance
    2. Understanding Write Performance
      1. BatchWriters
      2. Bulk Loading
    3. Hardware Selection
      1. Storage Devices
      2. Networking
      3. Virtualization
      4. Running in a Public Cloud Environment
    4. Cluster Sizing
      1. Modeling Required Write Performance
      2. Cluster Planning Example
    5. Analyzing Performance
      1. Using Tracing
      2. Using the Monitor
      3. Using Local Logs
    6. Tablet Server Tuning
      1. External Settings
      2. Memory Settings
      3. Write-Ahead Log Settings
      4. Resource Settings
      5. Timeouts
      6. Scaling Vertically
    7. Cluster Tuning
      1. Splitting Tables
      2. Balancing Tablets
      3. Balancing Reads and Writes
      4. Data Locality
      5. Sharing ZooKeeper
  16. A. Shell Commands Quick Reference
    1. Debugging
    2. Exiting
    3. Help
    4. Iterator
    5. Permissions Administration
    6. Shell Execution
    7. Shell State
    8. Table Administration
    9. Table Control
    10. User Administration
    11. Writing, Reading, and Removing Data
  17. B. Metadata Table
    1. Row ID
    2. File Column Family
    3. Scan Column Family
    4. future, last, and loc Column Families
    5. log Column Family
    6. srv Column Family
    7. ~tab:~pr Column
    8. Other Columns
  18. C. Data Stored in ZooKeeper
    1. masters, tservers, gc, monitor, and tracers Nodes
    2. problems/problem_info Nodes
    3. root_tablet Node
    4. tables/table_id Nodes
    5. config/system_property_name Node
    6. users/username Nodes
    7. Other Nodes
  19. Index