You are previewing Learning Cassandra for Administrators.
O'Reilly logo
Learning Cassandra for Administrators

Book Description

Understand the immense capabilities of Cassandra in managing large amounts of data and learn how to ensure that data is always available. This practical, hands-on guide takes you through every stage from installation to performance tuning.

  • Install and set up a multi datacenter Cassandra Troubleshoot and tune Cassandra Covers CAP tradeoffs, physical/hardware limitations, and helps you understand the magic Tune your kernel, JVM, to maximize the performance Includes security, monitoring metrics, Hadoop configuration, and query tracing

  • In Detail

    Apache Cassandra is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers linear scalability and performance across many commodity servers with no single point of failure. This book starts by explaining how to derive the solution, basic concepts, and CAP theorem. You will learn how to install and configure a Cassandra cluster as well as tune the cluster for performance. After reading the book, you should be able to understand why the system works in a particular way, and you will also be able to find patterns (and/or use cases) and anti-patterns which would potentially cause performance degradation. Furthermore, the book explains how to configure Hadoop, vnodes, multi-DC clusters, enabling trace, enabling various security features, and querying data from Cassandra. Starting with explaining about the trade-offs, we gradually learn about setting up and configuring high performance clusters. This book will help the administrators understand the system better by understanding various components in Cassandra’s architecture and hence be more productive in operating the cluster. This book talks about the use cases and problems, anti-patterns, and potential practical solutions as opposed to raw techniques. You will learn about kernel and JVM tuning parameters that can be adjusted to get the maximum use out of system resources.

    Table of Contents

    1. Learning Cassandra for Administrators
      1. Table of Contents
      2. Learning Cassandra for Administrators
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Errata
          2. Piracy
          3. Questions
      8. 1. Basic Concepts and Architecture
        1. CAP theorem
        2. BigTable / Log-structured data model
          1. Column families
          2. Keyspace
          3. Sorted String Table (SSTable)
          4. Memtable
          5. Compaction
        3. Partitioning and replication Dynamo style
          1. Gossip protocol
          2. Distributed hash table
          3. Eventual consistency
        4. Summary
      9. 2. Installing Cassandra
        1. Memory, CPU, and network requirements
        2. Cassandra in-memory data structures
          1. Index summary
          2. Bloom filter
          3. Compression metadata
          4. SSDs versus spinning disks
          5. Key cache
          6. Row cache
        3. Downloading/choosing binaries to install
          1. Configuring cassandra-env.sh
          2. Configuring Cassandra.yaml
            1. cluster_name
            2. seed_provider
            3. Partitioner
            4. auto_bootstrap
            5. broadcast_address
            6. commitlog_directory
            7. data_file_directories
            8. disk_failure_policy
            9. initial_token
            10. listen_address/rpc_address
            11. Ports
            12. endpoint_snitch
            13. commitlog_sync
            14. commitlog_segment_size_in_mb
            15. commitlog_total_space_in_mb
            16. Key cache and row cache saved to disk
            17. compaction_preheat_key_cache
            18. row_cache_provider
            19. column_index_size_in_kb
            20. compaction_throughput_mb_per_sec
            21. in_memory_compaction_limit_in_mb
            22. concurrent_compactors
            23. populate_io_cache_on_flush
            24. concurrent_reads
            25. concurrent_writes
            26. flush_largest_memtables_at
            27. index_interval
            28. memtable_total_space_in_mb
            29. memtable_flush_queue_size
            30. memtable_flush_writers
            31. stream_throughput_outbound_megabits_per_sec
            32. request_scheduler
            33. request_scheduler_options
            34. rpc_keepalive
            35. rpc_server_type
            36. thrift_framed_transport_size_in_mb
            37. rpc_max_threads
            38. rpc_min_threads
            39. Timeouts
          3. Dynamic snitch
          4. Backup configurations
            1. incremental_backups
            2. auto_snapshot
        4. Cassandra on EC2 instance
          1. Snitch
        5. Create a keyspace
          1. Creating a column family
            1. GC grace period
            2. Compaction
            3. Minimum and maximum compaction threshold
          2. Secondary indexes
          3. Composite primary key type
            1. Options
          4. read_repair_chance and dclocal_read_repair_chance
        6. Summary
      10. 3. Inserting Data and Manipulating Data
        1. Querying data
          1. USE
          2. CREATE
          3. ALTER
          4. DESCRIBE
          5. SELECT
        2. Tracing
        3. Data modeling
          1. Types of columns
          2. Common Cassandra data models
            1. Denormalization
            2. Creating a counter column family
            3. Tweet data structure
            4. Secondary index examples
              1. Creating a secondary index table
              2. Internal data structure
              3. Indexed column family
              4. Creating an index
        4. Summary
      11. 4. Administration and Large Deployments
        1. Manual repair
        2. Bootstrapping
          1. Vnodes
            1. Node tool commands
            2. Cfhistograms
            3. Cleanup
            4. Decommission
            5. Drain
        3. Monitoring tools
          1. DataStax OpsCenter
          2. Basic JMX monitoring
        4. Summary
      12. 5. Performance Tuning
        1. vmstat
        2. iostat
        3. dstat
        4. Garbage collection
          1. Enabling GC logging
          2. Understanding GCLogs
            1. Stop-the-world GC
            2. The jstat tool
            3. The jmap tool
          3. The write surveillance mode
        5. Tuning memtables
          1. memtable_flush_writers
        6. Compaction tuning
          1. SizeTieredCompactionStrategy
          2. LeveledCompactionStrategy
        7. Compression
          1. NodeTool
          2. compactionstats
          3. netstats
          4. tpstats
          5. Cassandra's caches
            1. Filesystem caches
          6. Separate drive for commit logs
          7. Tuning the kernel for Cassandra
          8. noop scheduler
          9. NUMA
          10. Other tuning parameters
          11. Dynamic snitch
          12. Configuring a Cassandra multiregion cluster
        8. Summary
      13. 6. Analytics
        1. Hadoop integration
          1. Configuring Hadoop with Cassandra
            1. Virtual datacenter
              1. PropertyFileSnitch
              2. GossipingPropertyFileSnitch
              3. DSE Hadoop
          2. Acunu Analytics
          3. Reading data directly from Cassandra
          4. Analytics on backups
            1. File streaming
              1. Keyspace and column family settings
              2. Communication configuration using the Thrift interface with Cassandra
              3. HDFS location of the temporary files
        2. Summary
      14. 7. Security and Troubleshooting
        1. Encryption
          1. Creating a keystore
          2. Creating a truststore
          3. Transparent data encryption
            1. Keyspace authentication (simple authenticator)
            2. JMX authentication
        2. Audit
        3. Things to look out for
        4. Summary
      15. Index