Cassandra: The Definitive Guide

Book description

What could you do with data if scalability wasn't a problem? With this hands-on guide, you'll learn how Apache Cassandra handles hundreds of terabytes of data while remaining highly available across multiple data centers -- capabilities that have attracted Facebook, Twitter, and other data-intensive companies. Cassandra: The Definitive Guide provides the technical details and practical examples you need to assess this database management system and put it to work in a production environment.

Author Eben Hewitt demonstrates the advantages of Cassandra's nonrelational design, and pays special attention to data modeling. If you're a developer, DBA, application architect, or manager looking to solve a database scaling issue or future-proof your application, this guide shows you how to harness Cassandra's speed and flexibility.

  • Understand the tenets of Cassandra's column-oriented structure
  • Learn how to write, update, and read Cassandra data
  • Discover how to add or remove nodes from the cluster as your application requires
  • Examine a working application that translates from a relational model to Cassandra's data model
  • Use examples for writing clients in Java, Python, and C#
  • Use the JMX interface to monitor a cluster's usage, memory patterns, and more
  • Tune memory settings, data storage, and caching for better performance

Publisher resources

View/Submit Errata

Table of contents

  1. Dedication
  2. A Note Regarding Supplemental Files
  3. Foreword
  4. Preface
    1. Why Apache Cassandra?
    2. Is This Book for You?
    3. What’s in This Book?
    4. Finding Out More
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Enabled
    8. How to Contact Us
    9. Acknowledgments
  5. 1. Introducing Cassandra
    1. What’s Wrong with Relational Databases?
    2. A Quick Review of Relational Databases
      1. RDBMS: The Awesome and the Not-So-Much
        1. Transactions, ACID-ity, and two-phase commit
        2. Schema
        3. Sharding and shared-nothing architecture
        4. Summary
      2. Web Scale
    3. The Cassandra Elevator Pitch
      1. Cassandra in 50 Words or Less
      2. Distributed and Decentralized
      3. Elastic Scalability
      4. High Availability and Fault Tolerance
      5. Tuneable Consistency
      6. Brewer’s CAP Theorem
      7. Row-Oriented
      8. Schema-Free
      9. High Performance
    4. Where Did Cassandra Come From?
    5. Use Cases for Cassandra
      1. Large Deployments
      2. Lots of Writes, Statistics, and Analysis
      3. Geographical Distribution
      4. Evolving Applications
    6. Who Is Using Cassandra?
    7. Summary
  6. 2. Installing Cassandra
    1. Installing the Binary
      1. Extracting the Download
      2. What’s In There?
    2. Building from Source
      1. Additional Build Targets
      2. Building with Maven
    3. Running Cassandra
      1. On Windows
      2. On Linux
      3. Starting the Server
    4. Running the Command-Line Client Interface
    5. Basic CLI Commands
      1. Help
      2. Connecting to a Server
      3. Describing the Environment
      4. Creating a Keyspace and Column Family
      5. Writing and Reading Data
    6. Summary
  7. 3. The Cassandra Data Model
    1. The Relational Data Model
    2. A Simple Introduction
    3. Clusters
    4. Keyspaces
    5. Column Families
      1. Column Family Options
    6. Columns
      1. Wide Rows, Skinny Rows
      2. Column Sorting
    7. Super Columns
      1. Composite Keys
    8. Design Differences Between RDBMS and Cassandra
      1. No Query Language
      2. No Referential Integrity
      3. Secondary Indexes
      4. Sorting Is a Design Decision
      5. Denormalization
    9. Design Patterns
      1. Materialized View
      2. Valueless Column
      3. Aggregate Key
    10. Some Things to Keep in Mind
    11. Summary
  8. 4. Sample Application
    1. Data Design
    2. Hotel App RDBMS Design
    3. Hotel App Cassandra Design
    4. Hotel Application Code
      1. Creating the Database
        1. Loading the schema
      2. Data Structures
      3. Getting a Connection
      4. Prepopulating the Database
      5. The Search Application
    5. Twissandra
    6. Summary
  9. 5. The Cassandra Architecture
    1. System Keyspace
    2. Peer-to-Peer
    3. Gossip and Failure Detection
    4. Anti-Entropy and Read Repair
    5. Memtables, SSTables, and Commit Logs
    6. Hinted Handoff
    7. Compaction
    8. Bloom Filters
    9. Tombstones
    10. Staged Event-Driven Architecture (SEDA)
    11. Managers and Services
      1. Cassandra Daemon
      2. Storage Service
      3. Messaging Service
      4. Hinted Handoff Manager
    12. Summary
  10. 6. Configuring Cassandra
    1. Keyspaces
      1. Creating a Column Family
      2. Transitioning from 0.6 to 0.7
    2. Replicas
    3. Replica Placement Strategies
      1. Simple Strategy
      2. Old Network Topology Strategy
      3. Network Topology Strategy
    4. Replication Factor
      1. Increasing the Replication Factor
    5. Partitioners
      1. Random Partitioner
      2. Order-Preserving Partitioner
      3. Collating Order-Preserving Partitioner
      4. Byte-Ordered Partitioner
    6. Snitches
      1. Simple Snitch
      2. PropertyFileSnitch
    7. Creating a Cluster
      1. Changing the Cluster Name
      2. Adding Nodes to a Cluster
      3. Multiple Seed Nodes
    8. Dynamic Ring Participation
    9. Security
      1. Using SimpleAuthenticator
      2. Programmatic Authentication
      3. Using MD5 Encryption
      4. Providing Your Own Authentication
    10. Miscellaneous Settings
    11. Additional Tools
      1. Viewing Keys
      2. Importing Previous Configurations
    12. Summary
  11. 7. Reading and Writing Data
    1. Query Differences Between RDBMS and Cassandra
      1. No Update Query
      2. Record-Level Atomicity on Writes
      3. No Server-Side Transaction Support
      4. No Duplicate Keys
    2. Basic Write Properties
    3. Consistency Levels
    4. Basic Read Properties
    5. The API
      1. Ranges and Slices
    6. Setup and Inserting Data
    7. Using a Simple Get
    8. Seeding Some Values
    9. Slice Predicate
      1. Getting Particular Column Names with Get Slice
      2. Getting a Set of Columns with Slice Range
        1. Counts
        2. Reversed
      3. Getting All Columns in a Row
    10. Get Range Slices
    11. Multiget Slice
    12. Deleting
    13. Batch Mutates
      1. Batch Deletes
      2. Range Ghosts
    14. Programmatically Defining Keyspaces and Column Families
    15. Summary
  12. 8. Clients
    1. Basic Client API
    2. Thrift
      1. Thrift Support for Java
      2. Exceptions
      3. Thrift Summary
    3. Avro
      1. Avro Ant Targets
      2. Avro Specification
      3. Avro Summary
    4. A Bit of Git
    5. Connecting Client Nodes
      1. Client List
      2. Round-Robin DNS
      3. Load Balancer
    6. Cassandra Web Console
    7. Hector (Java)
      1. Features
      2. The Hector API
    8. HectorSharp (C#)
    9. Chirper
    10. Chiton (Python)
    11. Pelops (Java)
    12. Kundera (Java ORM)
    13. Fauna (Ruby)
    14. Summary
  13. 9. Monitoring
    1. Logging
      1. Tailing
      2. General Tips
        1. Following along
        2. Warning signs
    2. Overview of JMX and MBeans
      1. MBeans
      2. Integrating JMX
    3. Interacting with Cassandra via JMX
    4. Cassandra’s MBeans
      1. org.apache.cassandra.concurrent
      2. org.apache.cassandra.db
      3. org.apache.cassandra.gms
      4. org.apache.cassandra.service
        1. StorageService
        2. StreamingService
    5. Custom Cassandra MBeans
    6. Runtime Analysis Tools
      1. Heap Analysis with JMX and JHAT
      2. Detecting Thread Problems
    7. Health Check
    8. Summary
  14. 10. Maintenance
    1. Getting Ring Information
      1. Info
      2. Ring
        1. Range Tokens
    2. Getting Statistics
      1. Using cfstats
      2. Using tpstats
    3. Basic Maintenance
      1. Repair
      2. Flush
      3. Cleanup
    4. Snapshots
      1. Taking a Snapshot
      2. Clearing a Snapshot
    5. Load-Balancing the Cluster
      1. loadbalance and streams
    6. Decommissioning a Node
    7. Updating Nodes
      1. Removing Tokens
      2. Compaction Threshold
      3. Changing Column Families in a Working Cluster
    8. Summary
  15. 11. Performance Tuning
    1. Data Storage
    2. Reply Timeout
    3. Commit Logs
    4. Memtables
    5. Concurrency
    6. Caching
    7. Buffer Sizes
    8. Using the Python Stress Test
      1. Generating the Python Thrift Interfaces
        1. Getting Thrift
      2. Running the Python Stress Test
    9. Startup and JVM Settings
      1. Tuning the JVM
    10. Summary
  16. 12. Integrating Hadoop
    1. What Is Hadoop?
    2. Working with MapReduce
      1. Cassandra Hadoop Source Package
    3. Running the Word Count Example
      1. Outputting Data to Cassandra
      2. Hadoop Streaming
    4. Tools Above MapReduce
      1. Pig
      2. Hive
    5. Cluster Configuration
    6. Use Cases
      1. Raptr.com: Keith Thornhill
      2. Imagini: Dave Gardner
    7. Summary
  17. A. The Nonrelational Landscape
    1. Nonrelational Databases
    2. Object Databases
    3. XML Databases
      1. SoftwareAG Tamino
      2. eXist
      3. Oracle Berkeley XML DB
      4. MarkLogic Server
      5. Apache Xindice
      6. Summary
    4. Document-Oriented Databases
      1. IBM Lotus
      2. Apache CouchDB
      3. MongoDB
      4. Riak
    5. Graph Databases
      1. FlockDB
      2. Neo4J
    6. Key-Value Stores and Distributed Hashtables
      1. Amazon Dynamo
      2. Project Voldemort
      3. Redis
    7. Columnar Databases
      1. Google Bigtable
      2. HBase
      3. Hypertable
      4. Polyglot Persistence
    8. Summary
  18. Glossary
  19. Index
  20. About the Author
  21. Colophon
  22. Copyright

Product information

  • Title: Cassandra: The Definitive Guide
  • Author(s): Eben Hewitt
  • Release date: November 2010
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449390419