You are previewing Cassandra: The Definitive Guide.

Cassandra: The Definitive Guide

Cover of Cassandra: The Definitive Guide by Eben Hewitt Published by O'Reilly Media, Inc.
  1. Cassandra: The Definitive Guide
  2. Dedication
  3. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  4. A Note Regarding Supplemental Files
  5. Foreword
  6. Preface
    1. Why Apache Cassandra?
    2. Is This Book for You?
    3. What’s in This Book?
    4. Finding Out More
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Enabled
    8. How to Contact Us
    9. Acknowledgments
  7. 1. Introducing Cassandra
    1. What’s Wrong with Relational Databases?
    2. A Quick Review of Relational Databases
      1. RDBMS: The Awesome and the Not-So-Much
      2. Web Scale
    3. The Cassandra Elevator Pitch
      1. Cassandra in 50 Words or Less
      2. Distributed and Decentralized
      3. Elastic Scalability
      4. High Availability and Fault Tolerance
      5. Tuneable Consistency
      6. Brewer’s CAP Theorem
      7. Row-Oriented
      8. Schema-Free
      9. High Performance
    4. Where Did Cassandra Come From?
    5. Use Cases for Cassandra
      1. Large Deployments
      2. Lots of Writes, Statistics, and Analysis
      3. Geographical Distribution
      4. Evolving Applications
    6. Who Is Using Cassandra?
    7. Summary
  8. 2. Installing Cassandra
    1. Installing the Binary
      1. Extracting the Download
      2. What’s In There?
    2. Building from Source
      1. Additional Build Targets
      2. Building with Maven
    3. Running Cassandra
      1. On Windows
      2. On Linux
      3. Starting the Server
    4. Running the Command-Line Client Interface
    5. Basic CLI Commands
      1. Help
      2. Connecting to a Server
      3. Describing the Environment
      4. Creating a Keyspace and Column Family
      5. Writing and Reading Data
    6. Summary
  9. 3. The Cassandra Data Model
    1. The Relational Data Model
    2. A Simple Introduction
    3. Clusters
    4. Keyspaces
    5. Column Families
      1. Column Family Options
    6. Columns
      1. Wide Rows, Skinny Rows
      2. Column Sorting
    7. Super Columns
      1. Composite Keys
    8. Design Differences Between RDBMS and Cassandra
      1. No Query Language
      2. No Referential Integrity
      3. Secondary Indexes
      4. Sorting Is a Design Decision
      5. Denormalization
    9. Design Patterns
      1. Materialized View
      2. Valueless Column
      3. Aggregate Key
    10. Some Things to Keep in Mind
    11. Summary
  10. 4. Sample Application
    1. Data Design
    2. Hotel App RDBMS Design
    3. Hotel App Cassandra Design
    4. Hotel Application Code
      1. Creating the Database
      2. Data Structures
      3. Getting a Connection
      4. Prepopulating the Database
      5. The Search Application
    5. Twissandra
    6. Summary
  11. 5. The Cassandra Architecture
    1. System Keyspace
    2. Peer-to-Peer
    3. Gossip and Failure Detection
    4. Anti-Entropy and Read Repair
    5. Memtables, SSTables, and Commit Logs
    6. Hinted Handoff
    7. Compaction
    8. Bloom Filters
    9. Tombstones
    10. Staged Event-Driven Architecture (SEDA)
    11. Managers and Services
      1. Cassandra Daemon
      2. Storage Service
      3. Messaging Service
      4. Hinted Handoff Manager
    12. Summary
  12. 6. Configuring Cassandra
    1. Keyspaces
      1. Creating a Column Family
      2. Transitioning from 0.6 to 0.7
    2. Replicas
    3. Replica Placement Strategies
      1. Simple Strategy
      2. Old Network Topology Strategy
      3. Network Topology Strategy
    4. Replication Factor
      1. Increasing the Replication Factor
    5. Partitioners
      1. Random Partitioner
      2. Order-Preserving Partitioner
      3. Collating Order-Preserving Partitioner
      4. Byte-Ordered Partitioner
    6. Snitches
      1. Simple Snitch
      2. PropertyFileSnitch
    7. Creating a Cluster
      1. Changing the Cluster Name
      2. Adding Nodes to a Cluster
      3. Multiple Seed Nodes
    8. Dynamic Ring Participation
    9. Security
      1. Using SimpleAuthenticator
      2. Programmatic Authentication
      3. Using MD5 Encryption
      4. Providing Your Own Authentication
    10. Miscellaneous Settings
    11. Additional Tools
      1. Viewing Keys
      2. Importing Previous Configurations
    12. Summary
  13. 7. Reading and Writing Data
    1. Query Differences Between RDBMS and Cassandra
      1. No Update Query
      2. Record-Level Atomicity on Writes
      3. No Server-Side Transaction Support
      4. No Duplicate Keys
    2. Basic Write Properties
    3. Consistency Levels
    4. Basic Read Properties
    5. The API
      1. Ranges and Slices
    6. Setup and Inserting Data
    7. Using a Simple Get
    8. Seeding Some Values
    9. Slice Predicate
      1. Getting Particular Column Names with Get Slice
      2. Getting a Set of Columns with Slice Range
      3. Getting All Columns in a Row
    10. Get Range Slices
    11. Multiget Slice
    12. Deleting
    13. Batch Mutates
      1. Batch Deletes
      2. Range Ghosts
    14. Programmatically Defining Keyspaces and Column Families
    15. Summary
  14. 8. Clients
    1. Basic Client API
    2. Thrift
      1. Thrift Support for Java
      2. Exceptions
      3. Thrift Summary
    3. Avro
      1. Avro Ant Targets
      2. Avro Specification
      3. Avro Summary
    4. A Bit of Git
    5. Connecting Client Nodes
      1. Client List
      2. Round-Robin DNS
      3. Load Balancer
    6. Cassandra Web Console
    7. Hector (Java)
      1. Features
      2. The Hector API
    8. HectorSharp (C#)
    9. Chirper
    10. Chiton (Python)
    11. Pelops (Java)
    12. Kundera (Java ORM)
    13. Fauna (Ruby)
    14. Summary
  15. 9. Monitoring
    1. Logging
      1. Tailing
      2. General Tips
    2. Overview of JMX and MBeans
      1. MBeans
      2. Integrating JMX
    3. Interacting with Cassandra via JMX
    4. Cassandra’s MBeans
      1. org.apache.cassandra.concurrent
      2. org.apache.cassandra.db
      3. org.apache.cassandra.gms
      4. org.apache.cassandra.service
    5. Custom Cassandra MBeans
    6. Runtime Analysis Tools
      1. Heap Analysis with JMX and JHAT
      2. Detecting Thread Problems
    7. Health Check
    8. Summary
  16. 10. Maintenance
    1. Getting Ring Information
      1. Info
      2. Ring
    2. Getting Statistics
      1. Using cfstats
      2. Using tpstats
    3. Basic Maintenance
      1. Repair
      2. Flush
      3. Cleanup
    4. Snapshots
      1. Taking a Snapshot
      2. Clearing a Snapshot
    5. Load-Balancing the Cluster
      1. loadbalance and streams
    6. Decommissioning a Node
    7. Updating Nodes
      1. Removing Tokens
      2. Compaction Threshold
      3. Changing Column Families in a Working Cluster
    8. Summary
  17. 11. Performance Tuning
    1. Data Storage
    2. Reply Timeout
    3. Commit Logs
    4. Memtables
    5. Concurrency
    6. Caching
    7. Buffer Sizes
    8. Using the Python Stress Test
      1. Generating the Python Thrift Interfaces
      2. Running the Python Stress Test
    9. Startup and JVM Settings
      1. Tuning the JVM
    10. Summary
  18. 12. Integrating Hadoop
    1. What Is Hadoop?
    2. Working with MapReduce
      1. Cassandra Hadoop Source Package
    3. Running the Word Count Example
      1. Outputting Data to Cassandra
      2. Hadoop Streaming
    4. Tools Above MapReduce
      1. Pig
      2. Hive
    5. Cluster Configuration
    6. Use Cases
      1. Raptr.com: Keith Thornhill
      2. Imagini: Dave Gardner
    7. Summary
  19. A. The Nonrelational Landscape
    1. Nonrelational Databases
    2. Object Databases
    3. XML Databases
      1. SoftwareAG Tamino
      2. eXist
      3. Oracle Berkeley XML DB
      4. MarkLogic Server
      5. Apache Xindice
      6. Summary
    4. Document-Oriented Databases
      1. IBM Lotus
      2. Apache CouchDB
      3. MongoDB
      4. Riak
    5. Graph Databases
      1. FlockDB
      2. Neo4J
    6. Key-Value Stores and Distributed Hashtables
      1. Amazon Dynamo
      2. Project Voldemort
      3. Redis
    7. Columnar Databases
      1. Google Bigtable
      2. HBase
      3. Hypertable
      4. Polyglot Persistence
    8. Summary
  20. Glossary
  21. Index
  22. About the Author
  23. Colophon
  24. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  25. Copyright
O'Reilly logo

Chapter 3. The Cassandra Data Model

In this chapter, we’ll gain an understanding of Cassandra’s design goals, data model, and some general behavior characteristics.

For developers and administrators coming from the relational world, the Cassandra data model can be very difficult to understand initially. Some terms, such as “keyspace,” are completely new, and some, such as “column,” exist in both worlds but have different meanings. It can also be confusing if you’re trying to sort through the Dynamo or Bigtable source papers, because although Cassandra may be based on them, it has its own model.

So in this chapter we start from common ground and then work through the unfamiliar terms. Then, we do some actual modeling to help understand how to bridge the gap between the relational world and the world of Cassandra.

The Relational Data Model

In a relational database, we have the database itself, which is the outermost container that might correspond to a single application. The database contains tables. Tables have names and contain one or more columns, which also have names. When we add data to a table, we specify a value for every column defined; if we don’t have a value for a particular column, we use null. This new entry adds a row to the table, which we can later read if we know the row’s unique identifier (primary key), or by using a SQL statement that expresses some criteria that row might meet. If we want to update values in the table, we can update all of the rows or just some ...

The best content for your career. Discover unlimited learning on demand for around $1/day.