Mastering Apache Cassandra 3.x - Third Edition

Book description

Build, manage, and configure high-performing, reliable NoSQL database for your applications with Cassandra

Key Features

  • Write programs more efficiently using Cassandra's features with the help of examples
  • Configure Cassandra and fine-tune its parameters depending on your needs
  • Integrate Cassandra database with Apache Spark and build strong data analytics pipeline

Book Description

With ever-increasing rates of data creation, the demand for storing data fast and reliably becomes a need. Apache Cassandra is the perfect choice for building fault-tolerant and scalable databases. Mastering Apache Cassandra 3.x teaches you how to build and architect your clusters, configure and work with your nodes, and program in a high-throughput environment, helping you understand the power of Cassandra as per the new features.

Once you've covered a brief recap of the basics, you'll move on to deploying and monitoring a production setup and optimizing and integrating it with other software. You'll work with the advanced features of CQL and the new storage engine in order to understand how they function on the server-side. You'll explore the integration and interaction of Cassandra components, followed by discovering features such as token allocation algorithm, CQL3, vnodes, lightweight transactions, and data modelling in detail. Last but not least you will get to grips with Apache Spark.

By the end of this book, you'll be able to analyse big data, and build and manage high-performance databases for your application.

What you will learn

  • Write programs more efficiently using Cassandra's features more efficiently
  • Exploit the given infrastructure, improve performance, and tweak the Java Virtual Machine (JVM)
  • Use CQL3 in your application in order to simplify working with Cassandra
  • Configure Cassandra and fine-tune its parameters depending on your needs
  • Set up a cluster and learn how to scale it
  • Monitor a Cassandra cluster in different ways
  • Use Apache Spark and other big data processing tools

Who this book is for

Mastering Apache Cassandra 3.x is for you if you are a big data administrator, database administrator, architect, or developer who wants to build a high-performing, scalable, and fault-tolerant database. Prior knowledge of core concepts of databases is required.

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Mastering Apache Cassandra 3.x Third Edition
  3. Packt Upsell
    1. Why subscribe?
    2. Packt.com
  4. Foreward
  5. Contributors
    1. About the authors
    2. About the reviewers
    3. Packt is searching for authors like you
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Conventions used
    4. Get in touch
      1. Reviews
  7. Quick Start
    1. Introduction to Cassandra
      1. High availability
      2. Distributed
      3. Partitioned row store
    2. Installation
    3. Configuration
      1. cassandra.yaml
      2. cassandra-rackdc.properties
    4. Starting Cassandra
    5. Cassandra Cluster Manager
    6. A quick introduction to the data model
      1. Using Cassandra with cqlsh
    7. Shutting down Cassandra
    8. Summary
  8. Cassandra Architecture
    1. Why was Cassandra created?
      1. RDBMS and problems at scale
      2. Cassandra and the CAP theorem
    2. Cassandra's ring architecture
      1. Partitioners
        1. ByteOrderedPartitioner
        2. RandomPartitioner
        3. Murmur3Partitioner
      2. Single token range per node
      3. Vnodes
    3. Cassandra's write path
    4. Cassandra's read path
    5. On-disk storage
      1. SSTables
        1. How data was structured in prior versions
        2. How data is structured in newer versions
    6. Additional components of Cassandra
      1. Gossiper
      2. Snitch
      3. Phi failure-detector
      4. Tombstones
      5. Hinted handoff
      6. Compaction
      7. Repair
        1. Merkle tree calculation
        2. Streaming data
      8. Read repair
      9. Security
        1. Authentication
        2. Authorization
        3. Managing roles
        4. Client-to-node SSL
        5. Node-to-node SSL
    7. Summary
  9. Effective CQL
    1. An overview of Cassandra data modeling
      1. Cassandra storage model for early versions up to 2.2
        1. Cassandra storage model for versions 3.0 and beyond
        2. Data cells
    2. cqlsh
      1. Logging into cqlsh
        1. Problems connecting to cqlsh
          1. Local cluster without security enabled
          2. Remote cluster with user security enabled
          3. Remote cluster with auth and SSL enabled
        2. Connecting with cqlsh over SSL
          1. Converting the Java keyStore into a PKCS12 keyStore
          2. Exporting the certificate from the PKCS12 keyStore
          3. Modifying your cqlshrc file
          4. Testing your connection via cqlsh
    3. Getting started with CQL
      1. Creating a keyspace
        1. Single data center example
        2. Multi-data center example
      2. Creating a table
        1. Simple table example
        2. Clustering key example
        3. Composite partition key example
        4. Table options
      3. Data types
        1. Type conversion
      4. The primary key
        1. Designing a primary key
          1. Selecting a good partition key
          2. Selecting a good clustering key
      5. Querying data
        1. The IN operator
      6. Writing data
        1. Inserting data
        2. Updating data
        3. Deleting data
      7. Lightweight transactions
      8. Executing a BATCH statement
      9. The expiring cell
      10. Altering a keyspace
      11. Dropping a keyspace
      12. Altering a table
      13. Truncating a table
      14. Dropping a table
        1. Truncate versus drop
      15. Creating an index
        1. Caution with implementing secondary indexes
      16. Dropping an index
      17. Creating a custom data type
      18. Altering a custom type
      19. Dropping a custom type
      20. User management
        1. Creating a user and role
        2. Altering a user and role
        3. Dropping a user and role
        4. Granting permissions
        5. Revoking permissions
      21. Other CQL commands
        1. COUNT
        2. DISTINCT
        3. LIMIT
        4. STATIC
        5. User-defined functions
      22. cqlsh commands
        1. CONSISTENCY
        2. COPY
        3. DESCRIBE
        4. TRACING
    4. Summary
  10. Configuring a Cluster
    1. Evaluating instance requirements
      1. RAM
      2. CPU
      3. Disk
        1. Solid state drives
        2. Cloud storage offerings
        3. SAN and NAS
      4. Network
        1. Public cloud networks
        2. Firewall considerations
      5. Strategy for many small instances versus few large instances
    2. Operating system optimizations
      1. Disable swap
      2. XFS
      3. Limits
        1. limits.conf
        2. sysctl.conf
      4. Time synchronization
    3. Configuring the JVM
      1. Garbage collection
        1. CMS
        2. G1GC
        3. Garbage collection with Cassandra
        4. Installation of JVM
      2. JCE
    4. Configuring Cassandra
      1. cassandra.yaml
      2. cassandra-env.sh
      3. cassandra-rackdc.properties
        1. dc
        2. rack
        3. dc_suffix
        4. prefer_local
      4. cassandra-topology.properties
      5. jvm.options
      6. logback.xml
    5. Managing a deployment pipeline
      1. Orchestration tools
      2. Configuration management tools
      3. Recommended approach
      4. Local repository for downloadable files
    6. Summary
  11. Performance Tuning
    1. Cassandra-Stress
      1. The Cassandra-Stress YAML file
        1. name
        2. size
        3. population
        4. cluster
      2. Cassandra-Stress results
    2. Write performance
      1. Commitlog mount point
      2. Scaling out
        1. Scaling out a data center
    3. Read performance
      1. Compaction strategy selection
        1. Optimizing read throughput for time-series models
        2. Optimizing tables for read-heavy models
      2. Cache settings
        1. Appropriate uses for row-caching
      3. Compression
        1. Chunk size
      4. The bloom filter configuration
      5. Read performance issues
    4. Other performance considerations
      1. JVM configuration
      2. Cassandra anti-patterns
        1. Building a queue
        2. Query flexibility
        3. Querying an entire table
        4. Incorrect use of BATCH
      3. Network
    5. Summary
  12. Managing a Cluster
    1. Revisiting nodetool
      1. A warning about using nodetool
    2. Scaling up
      1. Adding nodes to a cluster
        1. Cleaning up the original nodes
      2. Adding a new data center
        1. Adjusting the cassandra-rackdc.properties file
        2. A warning about SimpleStrategy
        3. Streaming data
    3. Scaling down
      1. Removing nodes from a cluster
        1. Removing a live node
        2. Removing a dead node
          1. Other removenode options
        3. When removenode doesn't work (nodetool assassinate)
          1. Assassinating a node on an older version
      2. Removing a data center
    4. Backing up and restoring data
      1. Taking snapshots
      2. Enabling incremental backups
      3. Recovering from snapshots
    5. Maintenance
      1. Replacing a node
      2. Repair
        1. A warning about incremental repairs
        2. Cassandra Reaper
        3. Forcing read repairs at consistency – ALL
      3. Clearing snapshots and incremental backups
        1. Snapshots
        2. Incremental backups
      4. Compaction
        1. Why you should never invoke compaction manually
        2. Adjusting compaction throughput due to available resources
    6. Summary
  13. Monitoring
    1. JMX interface
      1. MBean packages exposed by Cassandra
      2. JConsole (GUI)
        1. Connection and overview
        2. Viewing metrics
        3. Performing an operation
      3. JMXTerm (CLI)
        1. Connection and domains
        2. Getting a metric
        3. Performing an operation
    2. The nodetool utility
      1. Monitoring using nodetool
        1. describecluster
        2. gcstats
        3. getcompactionthreshold
        4. getcompactionthroughput
        5. getconcurrentcompactors
        6. getendpoints
        7. getlogginglevels
        8. getstreamthroughput
        9. gettimeout
        10. gossipinfo
        11. info
        12. netstats
        13. proxyhistograms
        14. status
        15. tablestats
        16. tpstats
        17. verify
      2. Administering using nodetool
        1. cleanup
        2. drain
        3. flush
        4. resetlocalschema
        5. stopdaemon
        6. truncatehints
        7. upgradeSSTable
    3. Metric stack
      1. Telegraf
        1. Installation
        2. Configuration
      2. JMXTrans
        1. Installation
        2. Configuration
      3. InfluxDB
        1. Installation
        2. Configuration
        3. InfluxDB CLI
      4. Grafana
        1. Installation
        2. Configuration
        3. Visualization
      5. Alerting
        1. Custom setup
    4. Log stack
      1. The system/debug/gc logs
      2. Filebeat
        1. Installation
        2. Configuration
      3. Elasticsearch
        1. Installation
        2. Configuration
      4. Kibana
        1. Installation
        2. Configuration
    5. Troubleshooting
      1. High CPU usage
      2. Different garbage-collection patterns
      3. Hotspots
      4. Disk performance
      5. Node flakiness
    6. All-in-one Docker
      1. Creating a database and other monitoring components locally
      2. Web links
    7. Summary
  14. Application Development
    1. Getting started
      1. The path to failure
      2. Is Cassandra the right database?
        1. Good use cases for Apache Cassandra
        2. Use and expectations around application data consistency
      3. Choosing the right driver
    2. Building a Java application
      1. Driver dependency configuration with Apache Maven
      2. Connection class
        1. Other connection options
          1. Retry policy
          2. Default keyspace
          3. Port
          4. SSL
          5. Connection pooling options
      3. Starting simple – Hello World!
      4. Using the object mapper
      5. Building a data loader
        1. Asynchronous operations
        2. Data loader example
    3. Summary
  15. Integration with Apache Spark
    1. Spark
      1. Architecture
      2. Installation
        1. Running custom Spark Docker locally
      3. Configuration
      4. The web UI
        1. Master
        2. Worker
        3. Application
    2. PySpark
      1. Connection config
      2. Accessing Cassandra data
    3. SparkR
      1. Connection config
      2. Accessing Cassandra data
    4. RStudio
      1. Connection config
      2. Accessing Cassandra data
    5. Jupyter
      1. Architecture
      2. Installation
      3. Configuration
      4. Web UI
    6. PYSpark through Juypter
    7. Summary
  16. References
    1. Chapter 1 – Quick Start
    2. Chapter 2 – Cassandra Architecture
    3. Chapter 3 – Effective CQL
    4. Chapter 4 – Configuring a Cluster
    5. Chapter 5 – Performance Tuning
    6. Chapter 6 – Managing a Cluster
    7. Chapter 7 – Monitoring
    8. Chapter 8 – Application Development
    9. Chapter 9 – Integration with Apache Spark
  17. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Mastering Apache Cassandra 3.x - Third Edition
  • Author(s): Aaron Ploetz, Tejaswi Malepati, Nishant Neeraj
  • Release date: October 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781789131499