You are previewing Cassandra 3.x High Availability - Second Edition.
O'Reilly logo
Cassandra 3.x High Availability - Second Edition

Book Description

Achieve scalability and high availability without compromising on performance

About This Book

  • See how to get 100 percent uptime with your Cassandra applications using this easy-follow guide

  • Learn how to avoid common and not-so-common mistakes while working with Cassandra using this highly practical guide

  • Get familiar with the intricacies of working with Cassandra for high availability in your work environment with this go-to-guide

  • Who This Book Is For

    If you are a developer or DevOps engineer who has basic familiarity with Cassandra and you want to become an expert at creating highly available, fault tolerant systems using Cassandra, this book is for you.

    What You Will Learn

  • Understand how the core architecture of Cassandra enables highly available applications

  • Use replication and tunable consistency levels to balance consistency, availability, and performance

  • Set up multiple data centers to enable failover, load balancing, and geographic distribution

  • Add capacity to your cluster with zero downtime

  • Take advantage of high availability features in the native driver

  • Create data models that scale well and maximize availability

  • Understand common anti-patterns so you can avoid them

  • Keep your system working well even during failure scenarios

  • In Detail

    Apache Cassandra is a massively scalable, peer-to-peer database designed for 100 percent uptime, with deployments in the tens of thousands of nodes, all supporting petabytes of data. This book offers a practical insight into building highly available, real-world applications using Apache Cassandra.

    The book starts with the fundamentals, helping you to understand how Apache Cassandra’s architecture allows it to achieve 100 percent uptime when other systems struggle to do so. You’ll get an excellent understanding of data distribution, replication, and Cassandra’s highly tunable consistency model. Then we take an in-depth look at Cassandra's robust support for multiple data centers, and you’ll see how to scale out a cluster. Next, the book explores the domain of application design, with chapters discussing the native driver and data modeling. Lastly, you’ll find out how to steer clear of common anti-patterns and take advantage of Cassandra’s ability to fail gracefully.

    Style and approach

    This practical guide will get you implementing Cassandra right from the design to creating highly available systems. Through a systematic, step-by-step approach, you will learn different aspects of building highly available Cassandra applications and all this with the help of easy-to-follow examples, tips, and tricks.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the code file.

    Table of Contents

    1. Cassandra 3.x High Availability
      1. Cassandra 3.x High Availability - Second Edition
      2. Credits
      3. About the Author
      4. About the Reviewer
        1. eBooks, discount offers, and more
          1. Why subscribe?
      6. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      7. 1. Cassandras Approach to High Availability
        1. Introducing the ACID properties
        2. Monolithic simplicity
        3. Scaling consistency - the master-slave model
          1. Using sharding to scale writes
          2. Handling the death of the leader
        4. Breaking with tradition - Cassandra's alternative
        5. Cassandra's peer-to-peer approach
          1. Hashing to the rescue
          2. Replication across the cluster
            1. Replication across data centers
          3. The consistency continuum
            1. The CAP theorem
        6. Summary
      8. 2. Data Distribution
        1. Hash table fundamentals
          1. Distributed hash tables
        2. Consistent hashing
          1. How it works
        3. Token assignment
          1. Manually assigned tokens
          2. Vnodes
            1. How vnodes improve availability
              1. Adding and removing nodes
              2. Node rebuild
              3. Heterogeneous nodes
        4. Partitioners
          1. Hotspots
            1. A time-series example
        5. Summary
      9. 3. Replication
        1. The replication factor
          1. Replication strategies
            1. SimpleStrategy
            2. NetworkTopologyStrategy
        2. Snitches
          1. Maintaining the replication factor when a node fails
        3. Consistency conflicts
          1. Consistency levels
          2. Repairing data
        4. Balancing the replication factor with consistency
        5. Summary
      10. 4. Data Centers
        1. Use cases for multiple data centers
          1. Live backup
          2. Failover
          3. Load balancing
          4. Geographic distribution
          5. Online analysis
            1. Analysis using Hadoop
            2. Analysis using Spark
        2. Data center setup
          1. RackInferringSnitch
          2. PropertyFileSnitch
          3. GossipingPropertyFileSnitch
          4. Cloud snitches
        3. Replication across data centers
          1. Setting replication factors
          2. Consistency in a multiple data center environment
            1. Anatomy of a replicated write
            2. Achieving stronger consistency between data centers
        4. Summary
      11. 5. Scaling Out
        1. Choosing the right hardware configuration
        2. Scaling out versus scaling up
        3. Growing your cluster
          1. Adding nodes without vnodes
          2. Adding nodes with vnodes
          3. Adding a data center
        4. How to scale up
          1. Upgrading in place
          2. Scaling up using data center replication
        5. Removing nodes
          1. Removing nodes within a data center
          2. Decommissioning a data center
        6. Other data migration scenarios
        7. Snitch changes
        8. Summary
      12. 6. High Availability Features in the Native Java Client
        1. Thrift versus the native protocol
        2. Setting up the environment
        3. Connecting to the cluster
        4. Executing statements
          1. Prepared statements
          2. Batched statements
            1. Caution with batches
        5. Handling asynchronous requests
          1. Running queries in parallel
        6. Load balancing
          1. Failing over to a remote data center
          2. Downgrading consistency level
            1. Defining your own retry policy
          3. Token awareness
        7. Tying it all together
          1. Falling back to QUORUM
        8. Summary
      13. 7. Modeling for Availability
        1. How Cassandra stores data
          1. Implications of log-structured storage
        2. Understanding compaction
          1. Size-tiered compaction
          2. Leveled compaction
          3. Time-window compaction
        3. CQL under the hood
          1. Single primary key
          2. Compound keys
            1. Partition keys
            2. Clustering columns
            3. Composite partition keys
          3. The importance of the storage model
        4. Understanding queries
          1. Query by key
          2. Range queries
          3. Embracing denormalization
        5. Denormalizing using collections
          1. Sets
          2. Lists
          3. Maps
        6. Denormalizing with materialized views
        7. Working with time series data
          1. Designing for immutability
          2. Modeling sensor data
            1. The queries
            2. Time-based ordering
              1. Using a sentinel value
              2. Satisfying our queries
              3. When time is all that matters
        8. Working with geospatial data
        9. Summary
      14. 8. Anti-Patterns
        1. Multi-key queries
        2. Secondary indices
          1. Secondary indices under the hood
          2. Improvements with SASI
        3. Distributed joins
        4. Deleting data
          1. Garbage collection
          2. Resurrecting the dead
          3. The problem with tombstones
          4. Expiring columns
            1. TTL anti-patterns
          5. When null does not mean empty
          6. Cassandra is not a queue
        5. Unbounded row growth
        6. Summary
      15. 9. Failing Gracefully
        1. Knowledge is power
          1. Monitoring via JMX
          2. Using OpsCenter
          3. Choosing a management toolset
        2. Logging
          1. Cassandra logs
          2. Garbage collector logs
        3. Monitoring node metrics
          1. Thread pools
          2. Table statistics
          3. Finding latency outliers
          4. Communication metrics
        4. When a node goes down
          1. Marking a downed node
          2. Handling a downed node
          3. Handling slow nodes
        5. Backing up data
          1. Taking a snapshot
          2. Incremental backups
          3. Restoring from a snapshot
        6. Summary