You are previewing Cassandra: The Definitive Guide.

Cassandra: The Definitive Guide

Cover of Cassandra: The Definitive Guide by Eben Hewitt Published by O'Reilly Media, Inc.
  1. Cassandra: The Definitive Guide
  2. Dedication
  3. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  4. A Note Regarding Supplemental Files
  5. Foreword
  6. Preface
    1. Why Apache Cassandra?
    2. Is This Book for You?
    3. What’s in This Book?
    4. Finding Out More
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Enabled
    8. How to Contact Us
    9. Acknowledgments
  7. 1. Introducing Cassandra
    1. What’s Wrong with Relational Databases?
    2. A Quick Review of Relational Databases
      1. RDBMS: The Awesome and the Not-So-Much
      2. Web Scale
    3. The Cassandra Elevator Pitch
      1. Cassandra in 50 Words or Less
      2. Distributed and Decentralized
      3. Elastic Scalability
      4. High Availability and Fault Tolerance
      5. Tuneable Consistency
      6. Brewer’s CAP Theorem
      7. Row-Oriented
      8. Schema-Free
      9. High Performance
    4. Where Did Cassandra Come From?
    5. Use Cases for Cassandra
      1. Large Deployments
      2. Lots of Writes, Statistics, and Analysis
      3. Geographical Distribution
      4. Evolving Applications
    6. Who Is Using Cassandra?
    7. Summary
  8. 2. Installing Cassandra
    1. Installing the Binary
      1. Extracting the Download
      2. What’s In There?
    2. Building from Source
      1. Additional Build Targets
      2. Building with Maven
    3. Running Cassandra
      1. On Windows
      2. On Linux
      3. Starting the Server
    4. Running the Command-Line Client Interface
    5. Basic CLI Commands
      1. Help
      2. Connecting to a Server
      3. Describing the Environment
      4. Creating a Keyspace and Column Family
      5. Writing and Reading Data
    6. Summary
  9. 3. The Cassandra Data Model
    1. The Relational Data Model
    2. A Simple Introduction
    3. Clusters
    4. Keyspaces
    5. Column Families
      1. Column Family Options
    6. Columns
      1. Wide Rows, Skinny Rows
      2. Column Sorting
    7. Super Columns
      1. Composite Keys
    8. Design Differences Between RDBMS and Cassandra
      1. No Query Language
      2. No Referential Integrity
      3. Secondary Indexes
      4. Sorting Is a Design Decision
      5. Denormalization
    9. Design Patterns
      1. Materialized View
      2. Valueless Column
      3. Aggregate Key
    10. Some Things to Keep in Mind
    11. Summary
  10. 4. Sample Application
    1. Data Design
    2. Hotel App RDBMS Design
    3. Hotel App Cassandra Design
    4. Hotel Application Code
      1. Creating the Database
      2. Data Structures
      3. Getting a Connection
      4. Prepopulating the Database
      5. The Search Application
    5. Twissandra
    6. Summary
  11. 5. The Cassandra Architecture
    1. System Keyspace
    2. Peer-to-Peer
    3. Gossip and Failure Detection
    4. Anti-Entropy and Read Repair
    5. Memtables, SSTables, and Commit Logs
    6. Hinted Handoff
    7. Compaction
    8. Bloom Filters
    9. Tombstones
    10. Staged Event-Driven Architecture (SEDA)
    11. Managers and Services
      1. Cassandra Daemon
      2. Storage Service
      3. Messaging Service
      4. Hinted Handoff Manager
    12. Summary
  12. 6. Configuring Cassandra
    1. Keyspaces
      1. Creating a Column Family
      2. Transitioning from 0.6 to 0.7
    2. Replicas
    3. Replica Placement Strategies
      1. Simple Strategy
      2. Old Network Topology Strategy
      3. Network Topology Strategy
    4. Replication Factor
      1. Increasing the Replication Factor
    5. Partitioners
      1. Random Partitioner
      2. Order-Preserving Partitioner
      3. Collating Order-Preserving Partitioner
      4. Byte-Ordered Partitioner
    6. Snitches
      1. Simple Snitch
      2. PropertyFileSnitch
    7. Creating a Cluster
      1. Changing the Cluster Name
      2. Adding Nodes to a Cluster
      3. Multiple Seed Nodes
    8. Dynamic Ring Participation
    9. Security
      1. Using SimpleAuthenticator
      2. Programmatic Authentication
      3. Using MD5 Encryption
      4. Providing Your Own Authentication
    10. Miscellaneous Settings
    11. Additional Tools
      1. Viewing Keys
      2. Importing Previous Configurations
    12. Summary
  13. 7. Reading and Writing Data
    1. Query Differences Between RDBMS and Cassandra
      1. No Update Query
      2. Record-Level Atomicity on Writes
      3. No Server-Side Transaction Support
      4. No Duplicate Keys
    2. Basic Write Properties
    3. Consistency Levels
    4. Basic Read Properties
    5. The API
      1. Ranges and Slices
    6. Setup and Inserting Data
    7. Using a Simple Get
    8. Seeding Some Values
    9. Slice Predicate
      1. Getting Particular Column Names with Get Slice
      2. Getting a Set of Columns with Slice Range
      3. Getting All Columns in a Row
    10. Get Range Slices
    11. Multiget Slice
    12. Deleting
    13. Batch Mutates
      1. Batch Deletes
      2. Range Ghosts
    14. Programmatically Defining Keyspaces and Column Families
    15. Summary
  14. 8. Clients
    1. Basic Client API
    2. Thrift
      1. Thrift Support for Java
      2. Exceptions
      3. Thrift Summary
    3. Avro
      1. Avro Ant Targets
      2. Avro Specification
      3. Avro Summary
    4. A Bit of Git
    5. Connecting Client Nodes
      1. Client List
      2. Round-Robin DNS
      3. Load Balancer
    6. Cassandra Web Console
    7. Hector (Java)
      1. Features
      2. The Hector API
    8. HectorSharp (C#)
    9. Chirper
    10. Chiton (Python)
    11. Pelops (Java)
    12. Kundera (Java ORM)
    13. Fauna (Ruby)
    14. Summary
  15. 9. Monitoring
    1. Logging
      1. Tailing
      2. General Tips
    2. Overview of JMX and MBeans
      1. MBeans
      2. Integrating JMX
    3. Interacting with Cassandra via JMX
    4. Cassandra’s MBeans
      1. org.apache.cassandra.concurrent
      2. org.apache.cassandra.db
      3. org.apache.cassandra.gms
      4. org.apache.cassandra.service
    5. Custom Cassandra MBeans
    6. Runtime Analysis Tools
      1. Heap Analysis with JMX and JHAT
      2. Detecting Thread Problems
    7. Health Check
    8. Summary
  16. 10. Maintenance
    1. Getting Ring Information
      1. Info
      2. Ring
    2. Getting Statistics
      1. Using cfstats
      2. Using tpstats
    3. Basic Maintenance
      1. Repair
      2. Flush
      3. Cleanup
    4. Snapshots
      1. Taking a Snapshot
      2. Clearing a Snapshot
    5. Load-Balancing the Cluster
      1. loadbalance and streams
    6. Decommissioning a Node
    7. Updating Nodes
      1. Removing Tokens
      2. Compaction Threshold
      3. Changing Column Families in a Working Cluster
    8. Summary
  17. 11. Performance Tuning
    1. Data Storage
    2. Reply Timeout
    3. Commit Logs
    4. Memtables
    5. Concurrency
    6. Caching
    7. Buffer Sizes
    8. Using the Python Stress Test
      1. Generating the Python Thrift Interfaces
      2. Running the Python Stress Test
    9. Startup and JVM Settings
      1. Tuning the JVM
    10. Summary
  18. 12. Integrating Hadoop
    1. What Is Hadoop?
    2. Working with MapReduce
      1. Cassandra Hadoop Source Package
    3. Running the Word Count Example
      1. Outputting Data to Cassandra
      2. Hadoop Streaming
    4. Tools Above MapReduce
      1. Pig
      2. Hive
    5. Cluster Configuration
    6. Use Cases
      1. Raptr.com: Keith Thornhill
      2. Imagini: Dave Gardner
    7. Summary
  19. A. The Nonrelational Landscape
    1. Nonrelational Databases
    2. Object Databases
    3. XML Databases
      1. SoftwareAG Tamino
      2. eXist
      3. Oracle Berkeley XML DB
      4. MarkLogic Server
      5. Apache Xindice
      6. Summary
    4. Document-Oriented Databases
      1. IBM Lotus
      2. Apache CouchDB
      3. MongoDB
      4. Riak
    5. Graph Databases
      1. FlockDB
      2. Neo4J
    6. Key-Value Stores and Distributed Hashtables
      1. Amazon Dynamo
      2. Project Voldemort
      3. Redis
    7. Columnar Databases
      1. Google Bigtable
      2. HBase
      3. Hypertable
      4. Polyglot Persistence
    8. Summary
  20. Glossary
  21. Index
  22. About the Author
  23. Colophon
  24. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  25. Copyright
O'Reilly logo

Chapter 2. Installing Cassandra

For those among us who like instant gratification, we’ll start by installing Cassandra. Because Cassandra introduces a lot of new vocabulary, there might be some unfamiliar terms as we walk through this. That’s OK; the idea here is to get set up quickly in a simple configuration to make sure everything is running properly. This will serve as an orientation. Then, we’ll take a step back and understand Cassandra in its larger context.

Installing the Binary

Cassandra is available for download from the Web at http://cassandra.apache.org. Just click the link on the home page to download the latest release version as a gzipped tarball. The prebuilt binary is named apache-cassandra-x.x.x-bin.tar.gz, where x.x.x represents the version number. The download is around 10MB.

Extracting the Download

The simplest way to get started is to download the prebuilt binary. You can unpack the compressed file using any regular ZIP utility. On Linux, GZip extraction utilities should be preinstalled; on Windows, you’ll need to get a program such as WinZip, which is commercial, or something like 7-Zip, which is freeware. You can download the freeware program 7-Zip from http://www.7-zip.org.

Open your extracting program. You might have to extract the ZIP file and the TAR file in separate steps. Once you have a folder on your filesystem called apache-cassandra-x.x.x, you’re ready to run Cassandra.

What’s In There?

Once you decompress the tarball, you’ll see that the Cassandra binary ...

The best content for your career. Discover unlimited learning on demand for around $1/day.