You are previewing Mastering Apache Cassandra.
O'Reilly logo
Mastering Apache Cassandra

Book Description

Learn how to build more robust, scalable databases using Cassandra. From beginners to intermediates, this practical guide covers all the bases to help you get the most out of your infrastructure and using the full potential of Cassandra.

  • Complete coverage of all aspects of Cassandra

  • Discusses prominent patterns, pros and cons, and use cases

  • Contains briefs on integration with other software

  • In Detail

    Apache Cassandra is the perfect choice for building fault tolerant and scalable databases. Implementing Cassandra will enable you to take advantage of its features which include replication of data across multiple datacenters with lower latency rates. This book details these features that will guide you towards mastering the art of building high performing databases without compromising on performance.

    Mastering Apache Cassandra aims to give enough knowledge to enable you to program pragmatically and help you understand the limitations of Cassandra. You will also learn how to deploy a production setup and monitor it, understand what happens under the hood, and how to optimize and integrate it with other software.

    Mastering Apache Cassandra begins with a discussion on understanding Cassandra’s philosophy and design decisions while helping you understand how you can implement it to resolve business issues and run complex applications simultaneously.

    You will also get to know about how various components of Cassandra work with each other to give a robust distributed system. The different mechanisms that it provides to solve old problems in new ways are not as twisted as they seem; Cassandra is all about simplicity. Learn how to set up a cluster that can face a tornado of data reads and writes without wincing.

    If you are a beginner, you can use the examples to help you play around with Cassandra and test the water. If you are at an intermediate level, you may prefer to use this guide to help you dive into the architecture. To a DevOp, this book will help you manage and optimize your infrastructure. To a CTO, this book will help you unleash the power of Cassandra and discover the resources that it requires.

    Table of Contents

    1. Mastering Apache Cassandra
      1. Table of Contents
      2. Mastering Apache Cassandra
      3. Credits
      4. About the Author
      5. Acknowledgments
      6. About the Reviewers
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      9. 1. Quick Start
        1. Introduction to Cassandra
          1. Distributed database
          2. High availability
          3. Replication
          4. Multiple data centers
        2. A brief introduction to a data model
        3. Installing Cassandra locally
        4. CRUD with cassandra-cli
        5. Cassandra in action
          1. Modeling data
          2. Writing code
            1. Setting up
            2. Application
        6. Summary
      10. 2. Cassandra Architecture
        1. Problems in the RDBMS world
        2. Enter NoSQL
          1. The CAP theorem
            1. Consistency
            2. Availability
            3. Partition-tolerance
          2. Significance of the CAP theorem
        3. Cassandra
        4. Cassandra architecture
          1. Ring representation
          2. How Cassandra works
            1. Write in action
            2. Read in action
          3. Components of Cassandra
            1. Messaging service
            2. Gossip
            3. Failure detection
            4. Partitioner
            5. Replication
            6. Log Structured Merge tree
            7. CommitLog
            8. MemTable
            9. SSTable
              1. Bloom filter
              2. Index files
              3. Datafiles
            10. Compaction
            11. Tombstones
            12. Hinted handoff
            13. Read repair and Anti-entropy
              1. Merkle tree
        5. Summary
      11. 3. Design Patterns
        1. The Cassandra data model
          1. The counter column
          2. The expiring column
          3. The super column
          4. The column family
          5. Keyspaces
          6. Data types – comparators and validators
            1. Writing a custom comparator
            2. The primary index
            3. The wide-row index
            4. Simple groups
            5. Sorting for free, free as in speech
            6. An inverse index with a super column family
            7. An inverse index with composite keys
            8. The secondary index
        2. Patterns and antipatterns
          1. Avoid storing an entity in a single column (wherever possible)
          2. Atomic update
          3. Managing time series data
            1. Wide-row time series
            2. High throughput rows and hotspots
            3. Advanced time series
          4. Avoid super columns
          5. Transaction woes
          6. Use expiring columns
          7. batch_mutate
        3. Summary
      12. 4. Deploying a Cluster
        1. Evaluating requirements
          1. Hard disk capacity
            1. RAM
            2. CPU
            3. Nodes
            4. Network
        2. System configurations
          1. Optimizing user limits
          2. Swapping memory
          3. Clock synchronization
          4. Disk readahead
        3. The required software
          1. Installing Oracle Java 6
            1. RHEL and CentOS systems
            2. Debian and Ubuntu systems
          2. Installing the Java Native Access (JNA) library
        4. Installing Cassandra
          1. Installing from a tarball
          2. Installing from ASFRepository for Debian/Ubuntu
          3. Anatomy of the installation
            1. Cassandra binaries
            2. Configuration files
              1. Setting up Cassandra's data directory and commit log directory
        5. Configuring a Cassandra cluster
          1. The cluster name
          2. The seed node
            1. Listen, broadcast, and RPC addresses
          3. Initial token
          4. Partitioners
            1. The random partitioner
            2. The byte-ordered partitioner
            3. The Murmur3 partitioner
          5. Snitches
            1. SimpleSnitch
            2. PropertyFileSnitch
            3. GossipingPropertyFileSnitch
            4. RackInferringSnitch
            5. EC2Snitch
            6. EC2MultiRegionSnitch
          6. Replica placement strategies
            1. SimpleStrategy
            2. NetworkTopologyStrategy
              1. NetworkTopologyStrategy and multiple data center setups
          7. Launching a cluster with a script
          8. Creating a keyspace
        6. Authorization and authentication
        7. Summary
      13. 5. Performance Tuning
        1. Stress testing
        2. Performance tuning
          1. Write performance
          2. Read performance
            1. Choosing the right compaction strategy
            2. Size tiered compaction strategy
            3. Leveled compaction
            4. Row cache
            5. Key cache
            6. Cache settings
            7. Enabling compression
            8. Tuning the bloom filter
          3. More tuning via cassandra.yaml
            1. index_interval
            2. commitlog_sync
            3. column_index_size_in_kb
            4. commitlog_total_space_in_mb
          4. Tweaking JVM
            1. Java heap
            2. Garbage collection
            3. Other JVM options
          5. Scaling horizontally and vertically
          6. Network
        3. Summary
      14. 6. Managing a Cluster – Scaling, Node Repair, and Backup
        1. Scaling
          1. Adding nodes to a cluster
          2. Removing nodes from a cluster
            1. Removing a live node
            2. Removing a dead node
        2. Replacing a node
        3. Backup and restoration
          1. Using Cassandra bulk loader to restore the data
        4. Load balancing
        5. Priam – managing large clusters on AWS
        6. Summary
      15. 7. Monitoring
        1. Cassandra JMX interface
          1. Accessing MBeans using JConsole
        2. Cassandra nodetool
          1. Monitoring with nodetool
            1. cfstats
            2. netstats
            3. ring and describering
            4. tpstats
            5. compactionstats
            6. info
          2. Administrating with nodetool
            1. drain
            2. decommission
            3. move
            4. removetoken
            5. repair
            6. upgradesstable
            7. snapshot
        3. DataStax OpsCenter
          1. OpsCenter Features
          2. Installing OpsCenter and an agent
            1. Prerequisites
            2. Running a Cassandra cluster
            3. Installing OpsCenter from Tarball
            4. Setting up an OpsCenter agent
          3. Monitoring and administrating with OpsCenter
          4. Other features of OpsCenter
        4. Nagios – monitoring and notification
          1. Installing Nagios
            1. Prerequisites
            2. Preparation
            3. Installation
              1. Installing Nagios
              2. Configuring Apache httpd
              3. Installing Nagios plugins
              4. Setting up Nagios as a service
            4. Nagios plugins
              1. Nagios plugins for Cassandra
              2. Executing remote plugins via an NRPE plugin
                1. Installing NRPE on host machines
                2. Installing NRPE plugin on a Nagios machine
              3. Setting things up to monitor
              4. Monitoring and notification using Nagios
        5. Cassandra log
          1. Enabling Java Options for GC Logging
        6. Troubleshooting
          1. High CPU usage
          2. High memory usage
          3. Hotspots
          4. OpenJDK may behave erratically
          5. Disk performance
          6. Slow snapshot
          7. Getting help from the mailing list
        7. Summary
      16. 8. Integration
        1. Using Hadoop
        2. Hadoop and Cassandra
          1. Introduction to Hadoop
            1. HDFS – Hadoop Distributed File System
            2. Data management
              1. NameNode
              2. DataNodes
            3. Hadoop MapReduce
              1. JobTracker
              2. TaskTracker
            4. Reliability of data and process in Hadoop
          2. Setting up local Hadoop
          3. Testing the installation
        3. Cassandra with Hadoop MapReduce
          1. ColumnFamilyInputFormat
          2. ColumnFamilyOutputFormat
          3. ConfigHelper
            1. Wide-row support
            2. Bulk loading
            3. Secondary index support
        4. Cassandra and Hadoop in action
          1. Executing, debugging, monitoring, and looking at results
        5. Hadoop in Cassandra cluster
          1. Cassandra filesystem
        6. Integration with Pig
          1. Installing Pig
          2. Integrating Pig and Cassandra
        7. Cassandra and Solr
          1. Development note on Solandra
            1. DataStax Enterprise – the next level Solr integration
        8. Summary
      17. 9. Introduction to CQL 3 and Cassandra 1.2
        1. CQL – the Cassandra Query Language
        2. CQL 3 for Thrift refugees
          1. Wide rows
          2. Composite columns
        3. CQL 3 basics
          1. The CREATE KEYSPACE query
          2. The CREATE TABLE query
          3. Compact storage
          4. Creating a secondary index
          5. The INSERT query
          6. The SELECT query
          7. select expression
          8. The WHERE clause
          9. The ORDER BY clause
          10. The LIMIT clause
          11. The USING CONSISTENCY clause
          12. The UPDATE query
          13. The DELETE query
          14. The TRUNCATE query
          15. The ALTER TABLE query
            1. Adding a new column
            2. Dropping an existing column
            3. Modifying the data type of an existing column
            4. Altering table options
          16. The ALTER KEYSPACE query
          17. BATCH querying
          18. The DROP INDEX query
          19. The DROP TABLE query
          20. The DROP KEYSPACE query
          21. The USE statement
        4. What's new in Cassandra 1.2?
          1. Virtual Nodes
          2. Off-heap Bloom filters
          3. JBOD improvements
          4. Parallel leveled compaction
          5. Murmur3 partitioner
          6. Atomic batches
          7. Query profiling
          8. Collections support
            1. Sets
            2. Lists
            3. Maps
        5. Support for programming languages
        6. Summary
      18. Index