Preface

Why Apache Cassandra?

Apache Cassandra is a free, open source, distributed data storage system that differs sharply from relational database management systems (RDBMSs).

Cassandra first started as an Incubator project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, released version 0.3 of Cassandra, and have steadily made releases ever since. Cassandra is being used in production by some of the biggest companies on the Web, including Facebook, Twitter, and Netflix.

Its popularity is due in large part to the outstanding technical features it provides. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there’s no single point of failure. It is highly available and offers a data model based on the Cassandra Query Language (CQL).

Is This Book for You?

This book is intended for a variety of audiences. It should be useful to you if you are:

  • A developer working with large-scale, high-volume applications, such as Web 2.0 social applications or ecommerce sites

  • An application architect or data architect who needs to understand the available options for high-performance, decentralized, elastic data stores

  • A database administrator or database developer currently working with standard relational database systems who needs to understand how to implement a fault-tolerant, eventually consistent data store

  • A manager who wants to understand the advantages (and disadvantages) of Cassandra and related columnar databases to help make decisions about technology strategy

  • A student, analyst, or researcher who is designing a project related to Cassandra or other non-relational data store options

This book is a technical guide. In many ways, Cassandra represents a new way of thinking about data. Many developers who gained their professional chops in the last 15–20 years have become well versed in thinking about data in purely relational or object-oriented terms. Cassandra’s data model is very different and can be difficult to wrap your mind around at first, especially for those of us with entrenched ideas about what a database is (and should be).

Using Cassandra does not mean that you have to be a Java developer. However, Cassandra is written in Java, so if you’re going to dive into the source code, a solid understanding of Java is crucial. Although it’s not strictly necessary to know Java, it can help you to better understand exceptions, how to build the source code, and how to use some of the popular clients. Many of the examples in this book are in Java. But because of the interface used to access Cassandra, you can use Cassandra from a wide variety of languages, including C#, Python, node.js, PHP, and Ruby.

Finally, it is assumed that you have a good understanding of how the Web works, can use an integrated development environment (IDE), and are somewhat familiar with the typical concerns of data-driven applications. You might be a well-seasoned developer or administrator but still, on occasion, encounter tools used in the Cassandra world that you’re not familiar with. For example, Apache Ant is used to build Cassandra, and the Cassandra source code is available via Git. In cases where we speculate that you’ll need to do a little setup of your own in order to work with the examples, we try to support that.

What’s in This Book?

This book is designed with the chapters acting, to a reasonable extent, as standalone guides. This is important for a book on Cassandra, which has a variety of audiences and is changing rapidly. To borrow from the software world, the book is designed to be “modular.” If you’re new to Cassandra, it makes sense to read the book in order; if you’ve passed the introductory stages, you will still find value in later chapters, which you can read as standalone guides.

Here is how the book is organized:

Chapter 1, Beyond Relational Databases

This chapter reviews the history of the enormously successful relational database and the recent rise of non-relational database technologies like Cassandra.

Chapter 2, Introducing Cassandra

This chapter introduces Cassandra and discusses what’s exciting and different about it, where it came from, and what its advantages are.

Chapter 3, Installing Cassandra

This chapter walks you through installing Cassandra, getting it running, and trying out some of its basic features.

Chapter 4, The Cassandra Query Language

Here we look at Cassandra’s data model, highlighting how it differs from the traditional relational model. We also explore how this data model is expressed in the Cassandra Query Language (CQL).

Chapter 5, Data Modeling

This chapter introduces principles and processes for data modeling in Cassandra. We analyze a well-understood domain to produce a working schema.

Chapter 6, The Cassandra Architecture

This chapter helps you understand what happens during read and write operations and how the database accomplishes some of its notable aspects, such as durability and high availability. We go under the hood to understand some of the more complex inner workings, such as the gossip protocol, hinted handoffs, read repairs, Merkle trees, and more.

Chapter 7, Configuring Cassandra

This chapter shows you how to specify partitioners, replica placement strategies, and snitches. We set up a cluster and see the implications of different configuration choices.

Chapter 8, Clients

There are a variety of clients available for different languages, including Java, Python, node.js, Ruby, C#, and PHP, in order to abstract Cassandra’s lower-level API. We help you understand common driver features.

Chapter 9, Reading and Writing Data

We build on the previous chapters to learn how Cassandra works “under the covers” to read and write data. We’ll also discuss concepts such as batches, lightweight transactions, and paging.

Chapter 10, Monitoring

Once your cluster is up and running, you’ll want to monitor its usage, memory patterns, and thread patterns, and understand its general activity. Cassandra has a rich Java Management Extensions (JMX) interface baked in, which we put to use to monitor all of these and more.

Chapter 11, Maintenance

The ongoing maintenance of a Cassandra cluster is made somewhat easier by some tools that ship with the server. We see how to decommission a node, load balance the cluster, get statistics, and perform other routine operational tasks.

Chapter 12, Performance Tuning

One of Cassandra’s most notable features is its speed—it’s very fast. But there are a number of things, including memory settings, data storage, hardware choices, caching, and buffer sizes, that you can tune to squeeze out even more performance.

Chapter 13, Security

NoSQL technologies are often slighted as being weak on security. Thankfully, Cassandra provides authentication, authorization, and encryption features, which we’ll learn how to configure in this chapter.

Chapter 14, Deploying and Integrating

We close the book with a discussion of considerations for planning cluster deployments, including cloud deployments using providers such as Amazon, Microsoft, and Google. We also introduce several technologies that are frequently paired with Cassandra to extend its capabilities.

Cassandra Versions Used in This Book

This book was developed using Apache Cassandra 3.0 and the DataStax Java Driver version 3.0. The formatting and content of tool output, log files, configuration files, and error messages are as they appear in the 3.0 release, and may change in future releases.

When discussing features added in releases 2.0 and later, we cite the release in which the feature was added for readers who may be using earlier versions and are considering whether to upgrade.

New for the Second Edition

The first edition of Cassandra: The Definitive Guide was the first book published on Cassandra, and has remained highly regarded over the years. However, the Cassandra landscape has changed significantly since 2010, both in terms of the technology itself and the community that develops and supports that technology. Here’s a summary of the key updates we’ve made to bring the book up to date:

A sense of history

The first edition was written against the 0.7 release in 2010. As of 2016, we’re up to the 3.X series. The most significant change has been the introduction of CQL and deprecation of the old Thrift API. Other new architectural features include secondary indexes, materialized views, and lightweight transactions. We provide a summary release history in Chapter 2 to help guide you through the changes. As we introduce new features throughout the text, we frequently cite the releases in which these features were added.

Giving developers a leg up

Development and testing with Cassandra has changed a lot over the years, with the introduction of the CQL shell (cqlsh) and the gradual replacement of community-developed clients with the drivers provided by DataStax. We give in-depth treatment to cqlsh in Chapters 3 and 4, and the drivers in Chapters 8 and 9. We also provide an expanded description of Cassandra’s read path and write path in Chapter 9 to enhance your understanding of the internals and help you understand the impact of decisions.

Maturing Cassandra operations

As more and more individuals and organizations have deployed Cassandra in production environments, the knowledge base of production challenges and best practices to meet those challenges has increased. We’ve added entirely new chapters on security (Chapter 13) and deployment and integration (Chapter 14), and greatly expanded the monitoring, maintenance, and performance tuning chapters (Chapters 10 through 12) in order to relate this collected wisdom.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This element signifies a tip or suggestion.

Note

This element signifies a general note.

Warning

This element indicates a warning or caution.

Using Code Examples

The code examples found in this book are available for download at https://github.com/jeffreyscarpenter/cassandra-guide.

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Cassandra: The Definitive Guide, Second Edition, by Jeff Carpenter. Copyright 2016 Jeff Carpenter, 978-1-491-93366-4.”

If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at .

O’Reilly Safari

Note

Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.

Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.
  • 1005 Gravenstein Highway North
  • Sebastopol, CA 95472
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515 (international or local)
  • 707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/cassandra2e.

To comment or ask technical questions about this book, send email to .

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

There are many wonderful people to whom we are grateful for helping bring this book to life.

Thank you to our technical reviewers: Stu Hood, Robert Schneider, and Gary Dusbabek contributed thoughtful reviews to the first edition, while Andrew Baker, Ewan Elliot, Kirk Damron, Corey Cole, Jeff Jirsa, and Patrick McFadin reviewed the second edition. Chris Judson’s feedback was key to the maturation of Chapter 14.

Thank you to Jonathan Ellis and Patrick McFadin for writing forewords for the first and second editions, respectively. Thanks also to Patrick for his contributions to the Spark integration section in Chapter 14.

Thanks to our editors, Mike Loukides and Marie Beaugureau, for their constant support and making this a better book.

Jeff would like to thank Eben for entrusting him with the opportunity to update such a well-regarded, foundational text, and for Eben’s encouragement from start to finish.

Finally, we’ve been inspired by the many terrific developers who have contributed to Cassandra. Hats off for making such an elegant and powerful database.

Get Cassandra: The Definitive Guide, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.