Posted on by & filed under Content - Highlights and Reviews, Mobile Development, Programming & Development, Web Development.

In a previous article we covered the basics of the Apache Cassandra database, a modern, open source, NoSQL database inspired by Google’s BigTable implementation. Cassandra exposes a unique data model composed of rows, columns, and column families all stored in a single table. This unique model requires different data modeling techniques from traditional relational databases. In this article we will explore these techniques and compare Cassandra’s data model with the SQL model to aid in understanding the differences between the two.

Keys, columns and column families

Cassandra is essentially a key-value store. This means that all data is stored only in one ‘table’, each row of which is uniquely identified by a key. However, unlike other key-value stores, Cassandra offers a rich data model in the form of columns and column families. Each row can have multiple columns, which are grouped into column families. The column families have to be predefined, but within each column family, an arbitrary number of columns can be added for each row. This organization can be illustrated with the help of the following JSON representation:

“user1” and “user2” are the keys for two rows defined. “user1” has data for only the “Bio” column family, whereas “user2” has data in both “Bio” and “Education” column families. This shows that for each row, data may or may not be in all the column families (that is, it is up to the application). Also within each column family, an arbitrary number of columns are defined for each row. For example, “user1” has “name” and “age” columns defined within the “Bio” column family, whereas “user2” has “name” and “profession” defined for the same column family.

Cassandra also has the concept of super columns. Super columns are columns that themselves have subcolumns. They can be thought of as column families in their own right. Using super columns, it is possible to nest data within columns up to an arbitrarily nested level. This allows complex data structures to be modeled in Cassandra. As we discussed before, Cassandra is key-value store at its core. Hence, it is best to think of the Cassandra model as a nested, map data structure. This model will be illustrated in the following examples.

Example data models

In order to understand the nature of the data model of Cassandra, a few “real world” examples are presented. In particular, the examples are of a microblogging platform. We will start by modeling the data for users. In traditional SQL, a table would be created with the following fields (for the simplest case):

1) User Id
2) Username
3) Password

These fields can be modeled in Cassandra using the “User” column family. This will contain all the above fields as columns. It will be indexed using the ‘user id’ key. Second comes the data about the post itself including title, content, and author. Again, this can be modeled using a “Post” column family, which will be indexed by a unique ‘post id’ key and will contain the columns for title, content and user id (author). Thus, for a user who has authored one post, the Cassandra data may look like:

Next comes the issue of modeling ‘following’ and ‘followers’ relationship. Again, this can be modeled using a column family, “Following”. This column family will be keyed using a user id and will contain columns to track the other user id in the relationship. For example, if there are three users then the relationships between them can be defined using:

By observing this model we can readily interpret that “user1” is being followed by “user2” and “user3”, “user2” is being followed by “user1” and “user3” is being followed by “user2”.

Conclusion

In this article we covered the basics of storing data in Cassandra’s unique BigTable-inspired model. We learned about the various components of the model including keys, columns, column families, and super columns. To summarize the discussion, the following table is presented which illustrates a rough analogy between a relational model and the Cassandra model.

Relational Model Cassandra Model
Database Keyspace
Table Column Family
Primary Key Key
Column name Column name
Column value (field) Column value

Safari Books Online has the content you need

Below are some Cassandra books to help you develop applications, or you can check out all of the Cassandra books and training videos available from Safari Books Online. You can browse the content in preview mode or you can gain access to more information with a free trial or subscription to Safari Books Online.

The rising popularity of Apache Cassandra rests on its ability to handle very large data sets that include hundreds of terabytes — and that’s why this distributed database has been chosen by organizations such as Facebook, Twitter, Digg, and Rackspace. With Cassandra: The Definitive Guide, you’ll get all the details and practical examples you need to understand Cassandra’s non-relational database design and put it to work in a production environment.
Apache Cassandra is a fault-tolerant, distributed data store which offers linear scalability allowing it to be a storage platform for large high volume websites. Cassandra High Performance Cookbook provides detailed recipes that describe how to use the features of Cassandra and improve its performance. The book also describes how to monitor and do capacity planning to ensure it is performing at a high level. Towards the end, it takes you through the use of libraries and third party applications with Cassandra and Cassandra integration with Hadoop.
Professional NoSQL is a comprehensive hands-on guide to the fundamental concepts and practical solutions for getting you ready to use NoSQL databases. Expert author Shashank Tiwari begins with a helpful introduction on the subject of NoSQL, explains its characteristics and typical uses, and looks at where it fits in the application stack. Unique insights help you choose which NoSQL solutions are best for solving your specific data storage needs.

About the authors

Salman Ul Haq is a techpreneur, co-founder and CEO of TunaCode, Inc., a startup that delivers GPU-accelerated computing solutions to time-critical application domains. He holds a degree is Computer Systems Engineering. His current focus is on delivering the right solution for cloud security. He can be reached at salman@tunacode.com.
Shaneeb Kamran is a Computer Engineer from one of the leading universities of Pakistan. His programming journey started at the age of 12 and ever since he has dabbled himself in every new and shiny software technology he could get his hands on. He is currently involved in a startup that is working on cloud computing products.

Tags: Apache Cassandra, BigTable, columns, JSON, key-value store, PlanetCassandra, super columns,

Comments are closed.