You are previewing Cassandra High Performance Cookbook.
O'Reilly logo
Cassandra High Performance Cookbook

Book Description

You can mine deep into the full capabilities of Apache Cassandra using the 150+ recipes in this indispensable Cookbook. From configuring and tuning to using third party applications, this is the ultimate guide.

  • Get the best out of Cassandra using this efficient recipe bank

  • Configure and tune Cassandra components to enhance performance

  • Deploy Cassandra in various environments and monitor its performance

  • Well illustrated, step-by-step recipes to make all tasks look easy!

  • In Detail

    Apache Cassandra is a fault-tolerant, distributed data store which offers linear scalability allowing it to be a storage platform for large high volume websites.

    This book provides detailed recipes that describe how to use the features of Cassandra and improve its performance. Recipes cover topics ranging from setting up Cassandra for the first time to complex multiple data center installations. The recipe format presents the information in a concise actionable form.

    The book describes in detail how features of Cassandra can be tuned and what the possible effects of tuning can be. Recipes include how to access data stored in Cassandra and use third party tools to help you out. The book also describes how to monitor and do capacity planning to ensure it is performing at a high level. Towards the end, it takes you through the use of libraries and third party applications with Cassandra and Cassandra integration with Hadoop

    Table of Contents

    1. Cassandra High Performance Cookbook
      1. Cassandra High Performance Cookbook
      2. Credits
      3. About the Author
      4. About the Reviewers
      5. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      6. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code for this book
          2. Errata
          3. Piracy
          4. Questions
      7. 1. Getting Started
        1. Introduction
        2. A simple single node Cassandra installation
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also...
        3. Reading and writing test data using the command-line interface
          1. How to do it...
          2. How it works...
          3. See also...
        4. Running multiple instances on a single machine
          1. How to do it...
          2. How it works...
          3. See also...
        5. Scripting a multiple instance installation
          1. How to do it...
          2. How it works...
        6. Setting up a build and test environment for tasks in this book
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        7. Running in the foreground with full debugging
          1. How to do it...
          2. How it works...
          3. There's more...
        8. Calculating ideal Initial Tokens for use with Random Partitioner
          1. Getting ready
          2. How to do it...
          3. How it works
          4. There's more...
          5. See also...
        9. Choosing Initial Tokens for use with Partitioners that preserve ordering
          1. How to do it...
          2. How it works...
          3. There's more...
        10. Insight into Cassandra with JConsole
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
        11. Connecting with JConsole over a SOCKS proxy
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Connecting to Cassandra with Java and Thrift
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
      8. 2. The Command-line Interface
        1. Connecting to Cassandra with the CLI
          1. How to do it...
          2. How it works...
        2. Creating a keyspace from the CLI
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also...
        3. Creating a column family with the CLI
          1. Getting ready
          2. How to do it...
          3. See also...
        4. Describing a keyspace
          1. How to do it...
          2. How it works...
        5. Writing data with the CLI
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        6. Reading data with the CLI
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
        7. Deleting rows and columns from the CLI
          1. How to do it...
          2. How it works...
          3. See also...
        8. Listing and paginating all rows in a column family
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Dropping a keyspace or a column family
          1. How to do it...
          2. How it works...
          3. See also...
        10. CLI operations with super columns
          1. How to do it...
          2. How it works...
          3. There's more...
        11. Using the assume keyword to decode column names or column values
          1. How to do it...
          2. How it works...
          3. There's more...
        12. Supplying time to live information when inserting columns
          1. How to do it...
          2. See also...
        13. Using built-in CLI functions
          1. How to do it...
          2. How it works...
        14. Using column metadata and comparators for type enforcement
          1. How to do it...
          2. How it works...
          3. See also...
        15. Changing the consistency level of the CLI
          1. How to do it...
          2. How it works...
          3. See also...
        16. Getting help from the CLI
          1. How to do it...
          2. How it works...
        17. Loading CLI statements from a file
          1. How to do it...
          2. How it works...
          3. There's more...
      9. 3. Application Programmer Interface
        1. Introduction
        2. Connecting to a Cassandra server
          1. How to do it...
          2. How it works...
          3. There's more
        3. Creating a keyspace and column family from the client
          1. How to do it...
          2. How it works...
          3. See also...
        4. Using MultiGet to limit round trips and overhead
          1. How to do it...
          2. How it works...
        5. Writing unit tests with an embedded Cassandra server
          1. How to do it...
          2. How it works...
          3. See also...
        6. Cleaning up data directories before unit tests
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Generating Thrift bindings for other languages (C++, PHP, and others)
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Using the Cassandra Storage Proxy "Fat Client"
          1. How to do it...
          2. How it works...
          3. There's more...
        9. Using range scans to find and remove old data
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also...
        10. Iterating all the columns of a large key
          1. How to do it...
          2. How it works...
        11. Slicing columns in reverse
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Batch mutations to improve insert performance and code robustness
          1. How to do it...
          2. How it works...
          3. See also...
        13. Using TTL to create columns with self-deletion times
          1. How to do it...
          2. How it works...
          3. See also...
        14. Working with secondary indexes
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
      10. 4. Performance Tuning
        1. Introduction
        2. Choosing an operating system and distribution
          1. How to do it...
          2. How it works...
          3. There's more...
        3. Choosing a Java Virtual Machine
          1. How to do it...
          2. There's more...
          3. See also...
        4. Using a dedicated Commit Log disk
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
        5. Choosing a high performing RAID level
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Software v/s hardware RAID
            2. Disk performance testing
          5. See also...
        6. File system optimization for hard disk performance
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Boosting read performance with the Key Cache
          1. Getting ready
          2. How to do it...
          3. How it works
          4. There's more...
          5. See also...
        8. Boosting read performance with the Row Cache
          1. How to do it...
          2. How it works...
          3. There's more...
        9. Disabling Swap Memory for predictable performance
          1. How to do it...
          2. How it works...
          3. See also...
        10. Stopping Cassandra from using swap without disabling it system-wide
          1. Getting ready
          2. How to do it...
        11. Enabling Memory Mapped Disk modes
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Tuning Memtables for write-heavy workloads
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also...
        13. Saving memory on 64 bit architectures with compressed pointers
          1. Getting ready
          2. How to do it...
          3. How it works...
        14. Tuning concurrent readers and writers for throughput
          1. How to do it...
          2. How it works...
        15. Setting compaction thresholds
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also...
        16. Garbage collection tuning to avoid JVM pauses
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Large memory systems
          4. See also...
          5. There's more
        17. Raising the open file limit to deal with many clients
          1. How to do it...
          2. How it works...
          3. There's more...
        18. Increasing performance by scaling up
          1. How to do it...
          2. How it works...
          3. Enabling Network Time Protocol on servers and clients
          4. Getting ready
          5. How to do it...
          6. How it works...
      11. 5. Consistency, Availability, and Partition Tolerance with Cassandra
        1. Introduction
        2. Working with the formula for strong consistency
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
        3. Supplying the timestamp value with write requests
          1. How to do it...
          2. How it works...
          3. There's more...
        4. Disabling the hinted handoff mechanism
          1. How to do it...
          2. How it works...
          3. There's more...
        5. Adjusting read repair chance for less intensive data reads
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also...
        6. Confirming schema agreement across the cluster
          1. How to do it...
          2. How it works...
          3. There's more...
        7. Adjusting replication factor to work with quorum
          1. How to do it...
          2. How it works...
          3. See also...
        8. Using write consistency ONE, read consistency ONE for low latency operations
          1. How to do it...
          2. How it works...
          3. There's more...
        9. Using write consistency QUORUM, read consistency QUORUM for strong consistency
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Mixing levels write consistency QUORUM, read consistency ONE
          1. Getting ready
          2. How to do it...
          3. How it works...
        11. Choosing consistency over availability consistency ALL
          1. How to do it...
          2. How it works...
        12. Choosing availability over consistency with write consistency ANY
          1. How to do it...
          2. How it works...
        13. Demonstrating how consistency is not a lock or a transaction
          1. How to do it...
          2. How it works...
          3. See also...
      12. 6. Schema Design
        1. Introduction
        2. Saving disk space by using small column names
          1. How to do it...
          2. How it works...
        3. Serializing data into large columns for smaller index sizes
          1. How to do it...
          2. How it works...
          3. There's more...
        4. Storing time series data effectively
          1. How to do it...
          2. How it works...
        5. Using Super Columns for nested maps
          1. How to do it...
          2. How it works...
          3. There's more...
        6. Using a lower Replication Factor for disk space saving and performance enhancements
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also...
        7. Hybrid Random Partitioner using Order Preserving Partitioner
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Scripting a multiple instance installation with OOP
            2. Using different hash algorithms
        8. Storing large objects
          1. How to do it...
          2. How it works...
          3. There's more...
        9. Using Cassandra for distributed caching
          1. How to do it...
          2. How it works...
        10. Storing large or infrequently accessed data in a separate column family
          1. How to do it...
          2. How it works...
        11. Storing and searching edge graph data in Cassandra
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Developing secondary data orderings or indexes
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
      13. 7. Administration
        1. Defining seed nodes for Gossip Communication
          1. Getting ready
          2. How to do it...
          3. There's more
            1. IP vs Hostname
            2. Keep the seed list synchronized
            3. Seed nodes do not auto bootstrap
            4. Choosing the correct number of seed nodes
        2. Nodetool Move: Moving a node to a specific ring location
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Nodetool Remove: Removing a downed node
          1. How to do it...
          2. How it works...
          3. See also...
        4. Nodetool Decommission: Removing a live node
          1. How to do it...
          2. How it works...
        5. Joining nodes quickly with auto_bootstrap set to false
        6. Generating SSH keys for password-less interaction
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Normal write traffic
            2. Read Repair
            3. Anti-Entropy Repair
          4. How to do it...
          5. How it works...
          6. These is more
        7. Copying the data directory to new hardware
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more
        8. A node join using external data copy methods
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Nodetool Repair: When to use anti-entropy repair
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Raising the Replication Factor
            2. Joining nodes without auto-bootstrap
            3. Loss of corrupted files
        10. Nodetool Drain: Stable files on upgrade
          1. How to do it...
          2. How it works...
        11. Lowering gc_grace for faster tombstone cleanup
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Data resurrection
        12. Scheduling Major Compaction
          1. How to do it...
          2. How it works...
          3. There's more...
        13. Using nodetool snapshot for backups
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also...
        14. Clearing snapshots with nodetool clearsnapshot
          1. Getting ready
          2. How to do it...
          3. How it works...
        15. Restoring from a snapshot
          1. How to do it...
          2. How it works...
          3. There's more...
        16. Exporting data to JSON with sstable2json
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Extracting specific keys
            2. Excluding specific keys
            3. Saving the exported JSON to a file
            4. Using the xxd command to decode hex values
        17. Nodetool cleanup: Removing excess data
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Topology changes
            2. Hinted handoff and write consistency ANY
          4. See also...
        18. Nodetool Compact: Defragment data and remove deleted data from disk
          1. How to do it...
          2. How it works...
          3. See also...
      14. 8. Multiple Datacenter Deployments
        1. Changing debugging to determine where read operations are being routed
          1. How to do it...
          2. How it works...
          3. See also...
        2. Using IPTables to simulate complex network scenarios in a local environment
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Choosing IP addresses to work with RackInferringSnitch
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also...
        4. Scripting a multiple datacenter installation
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Determining natural endpoints, datacenter, and rack for a given key
          1. How to do it...
          2. How it works...
          3. See also...
        6. Manually specifying Rack and Datacenter configuration with a property file snitch
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        7. Troubleshooting dynamic snitch using JConsole
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Quorum operations in multi-datacenter environments
          1. Getting ready
          2. How it works...
        9. Using traceroute to troubleshoot latency between network devices
          1. How to do it...
          2. How it works...
        10. Ensuring bandwidth between switches in multiple rack environments
          1. How to do it...
          2. There's more...
        11. Increasing rpc_timeout for dealing with latency across datacenters
          1. How to do it...
          2. How it works...
        12. Changing consistency level from the CLI to test various consistency levels with multiple datacenter deployments
          1. Getting ready
          2. How to do it...
          3. How it works...
        13. Using the consistency levels TWO and THREE
          1. Getting ready
          2. How to do it...
          3. How it works...
        14. Calculating Ideal Initial Tokens for use with Network Topology Strategy and Random Partitioner
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. More than two datacenters
            2. Datacenters with differing numbers of nodes
            3. Endpoint Snitch
          5. See also...
      15. 9. Coding and Internals
        1. Introduction
        2. Installing common development tools
          1. How to do it...
          2. How it works...
        3. Building Cassandra from source
          1. How to do it...
          2. How it works...
          3. See also...
        4. Creating your own type by sub classing abstract type
          1. How to do it...
          2. How it works...
          3. See also...
        5. Using the validation to check data on insertion
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Communicating with the Cassandra developers and users through IRC and e-mail
          1. How to do it...
          2. How it works...
        7. Generating a diff using subversion's diff feature
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also...
        8. Applying a diff using the patch command
          1. Before you begin...
          2. How to do it...
          3. How it works...
        9. Using strings and od to quickly search through data files
          1. How to do it...
          2. How it works...
        10. Customizing the sstable2json export utility
          1. How to do it...
          2. How it works...
          3. There's more...
        11. Configure index interval ratio for lower memory usage
          1. How to do it...
          2. How it works...
        12. Increasing phi_convict_threshold for less reliable networks
          1. How to do it...
          2. How it works...
          3. There's more...
        13. Using the Cassandra maven plugin
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
      16. 10. Libraries and Applications
        1. Introduction
        2. Building the contrib stress tool for benchmarking
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
        3. Inserting and reading data with the stress tool
          1. Before you begin...
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also...
        4. Running the Yahoo! Cloud Serving Benchmark
          1. How to do it...
          2. How it works...
          3. There's more...
        5. Hector, a high-level client for Cassandra
          1. How to do it...
          2. How it works...
          3. There's more...
        6. Doing batch mutations with Hector
          1. How to do it...
          2. How it works...
        7. Cassandra with Java Persistence Architecture (JPA)
          1. Before you begin...
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Setting up Solandra for full text indexing with a Cassandra backend
          1. How to do it...
          2. How it works...
        9. Setting up Zookeeper to support Cages for transactional locking
          1. How to do it...
          2. How it works...
          3. See also...
        10. Using Cages to implement an atomic read and set
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        11. Using Groovandra as a CLI alternative
          1. How to do it...
          2. How it works...
        12. Searchable log storage with Logsandra
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
      17. 11. Hadoop and Cassandra
        1. Introduction
        2. A pseudo-distributed Hadoop setup
          1. How to do it...
          2. How it works...
          3. There's more...
        3. A Map-only program that reads from Cassandra using the ColumnFamilyInputFormat
          1. How to do it...
          2. How it works...
          3. See also...
        4. A Map-only program that writes to Casandra using the CassandraOutputFormat
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Using MapReduce to do grouping and counting with Cassandra input and output
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Setting up Hive with Cassandra Storage Handler support
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
        7. Defining a Hive table over a Cassandra Column Family
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
        8. Joining two Column Families with Hive
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Grouping and counting column values with Hive
          1. How to do it...
          2. How it works...
          3. See also...
        10. Co-locating Hadoop Task Trackers on Cassandra nodes
          1. How to do it...
          2. How it works...
          3. See also...
        11. Setting up a "Shadow" data center for running only MapReduce jobs
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        12. Setting up DataStax Brisk the combined stack of Cassandra, Hadoop, and Hive
          1. How to do it...
          2. How it works
      18. 12. Collecting and Analyzing Performance Statistics
        1. Finding bottlenecks with nodetool tpstats
          1. How to do it...
          2. How it works...
          3. There's more...
        2. Using nodetool cfstats to retrieve column family statistics
          1. How to do it...
          2. How it works...
          3. See also...
        3. Monitoring CPU utilization
          1. How to do it...
          2. How it works...
          3. See also...
        4. Adding read/write graphs to find active column families
          1. How to do it...
          2. How it works...
          3. There's more...
        5. Using Memtable graphs to profile when and why they flush
          1. How it works...
          2. There's more...
          3. See also...
        6. Graphing SSTable count
          1. How to do it...
          2. There's more...
        7. Monitoring disk utilization and having a performance baseline
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also...
          5. How to do it...
          6. How it works...
          7. See also...
        8. Monitoring compaction by graphing its activity
          1. How it works...
          2. There's more...
          3. See also...
        9. Using nodetool compaction stats to check the progress of compaction
          1. How to do it...
          2. How it works...
        10. Graphing column family statistics to track average/max row sizes
          1. How to do it...
        11. Using latency graphs to profile time to seek keys
          1. How to do it...
          2. How it works...
        12. Tracking the physical disk size of each column family over time
          1. How to do it...
          2. How it works...
        13. Using nodetool cfhistograms to see the distribution of query latencies
          1. How to do it...
          2. How it works...
          3. See also...
        14. Tracking open networking connections
          1. How to do it...
          2. How it works...
          3. There's more...
      19. 13. Monitoring Cassandra Servers
        1. Introduction
        2. Forwarding Log4j logs to a central sever
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There’s more....
        3. Using top to understand overall performance
          1. How to do it...
          2. How it works...
          3. There's more...
        4. Using iostat to monitor current disk performance
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
        5. Using sar to review performance over time
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Using JMXTerm to access Cassandra JMX
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also...
        7. Monitoring the garbage collection events
          1. How to do it...
          2. How it works...
          3. There’s more...
        8. Using tpstats to find bottlenecks
          1. How to do it...
          2. How it works...
          3. See also...
        9. Creating a Nagios Check Script for Cassandra
          1. How to do it...
        10. Keep an eye out for large rows with compaction limits
          1. How to do it...
          2. How it works...
        11. Reviewing network traffic with IPTraf
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Keep on the lookout for dropped messages
          1. How to do it...
          2. How it works...
        13. Inspecting column families for dangerous conditions
          1. How to do it...
          2. How it works...