Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post from Nigel Small, whose current areas of interest include Python, JavaScript, PostgreSQL, Neo4j and Linux. He has also founded a number of open source projects, most significantly py2neo, and is an active blogger, speaker and Neo4j community member who can be reached at @technige.

The full release of Neo4j 2.0 will be something of a landmark for graph database fanatics. That’s not to say that other releases have not been significant, but this particular version is daring to break with its past and offer a smarter approach to database interaction.

As Neophiles will know, Cypher is Neo4j’s home grown data query and modification language and Cypher queries are based around matching paths within the graph. Traditionally, the paths supplied must be anchored to one or more explicitly selected START nodes, but with version 2.0, this is no longer necessary. This has been made possible through the introduction of labels and schema indexes, which provide a way to tag and identify specific nodes outside of legacy indexes and the internal ID mechanism.

The Old Way: Category Nodes

To illustrate the new approach, let’s first look at the old way of doing things with an example graph:


Here, Alice has two friends, Bob and Carmen, and speaks both Italian and English. This graph also makes use of the commonly used IS_A relationship to classify the objects in the system against a category node.

Now, let’s say we want to find Alice’s friends. A naive query might look like this:

This query will happily identify Bob and Carmen as friends, but it will also pick up Italian and English, which obviously aren’t people and wasn’t what we intended. We clearly need to make sure the friends we collect are actually people. So let’s extend our path to check for a relationship to the Person category node:

This now works and we can identify the people. But to do so we’ve had to introduce extra relationship checks. What began as a straightforward query has now increased in complexity and at scale this could become a problem.

The New Way: Labels

Now let’s look at the same data, but instead of category nodes, we’ll use labels:


Labels have a similar syntax to relationship types and allow nodes to be classified cleanly without the need for extra categorization relationships. They can be thought of in much the same way as object types, and each node may have several labels applied.

So we can now carry out the same query, this time constraining our path with labels:

This will make sure that only Bob and Carmen are selected as friends, leaving the languages untouched. If we wanted to, we could include the languages as well by simply omitting the Person label from node x.

You may have noticed that this new-style query is missing a START clause. Previously, the START clause was required and either referred to node IDs that were explicitly known, or looked up nodes in separately managed (now legacy) Lucene indexes. Now, instead, the MATCH clause can carry out implicit lookups as required using the labels and properties supplied. This has a huge benefit in creating both simpler graphs and cleaner queries.

But you might be wondering how a START-free query can be in any way performant when a large graph is in play. The answer comes from a new type of index: the schema index. Schema indexes are part of a broader set of schema features introduced in Neo4j 2.0, and can be created against any label and, optionally, property key. Cypher makes use of these indexes to speed up the selection of existing nodes by those indexed criteria.

Unlike legacy indexes, however, schema indexes can be created and dropped from within Cypher. For instance, to create an index for the Person label and the name property key, you would execute the following statement:

Similarly, to drop that index, you would run:

It’s important to note that schema indexes are eventually available. That’s to say, while they may not come to life immediately, the database engine will continue working in the background to bring them up to date. The same holds true when new data is added to the graph: existing indexes will be automatically synchronized.


So we’ve seen some improvements to how we query the graph. But let’s say we want to add some extra data. We want to add Bob’s friend, Dave, who also speaks only English. Our graph now looks like this:


To make this addition in earlier versions of Neo4j, we would need to perform an explicit lookup on our graph to find out where to attach this data, and we’d also need to take care to avoid duplication. A new Cypher keyword, MERGE, in combination with the CREATE UNIQUE clause, provides a way to add new data in a totally idempotent fashion. This means that we could run:

MERGE looks for nodes that match the label and property value supplied. If such nodes do not already exist, they are created, otherwise the existing ones are used. With these new or existing nodes, the CREATE UNIQUE statement can then insert the parts of the path that are not already present.

A MERGE statement can also take trailing clauses to define what happens when a match occurs or when a new node is created. Let’s say we want to maintain timestamps for when our nodes were created and last matched; we might run something like this:

The first time around, Bob doesn’t exist, so a new node is created and a creation timestamp is applied. Subsequently, Bob will be matched and the last_matched timestamp will be updated.


So if you haven’t yet had a look at Neo4j 2.0, pop over to and download a copy of the latest milestone, since you’re missing out on some great new features! Labels are the perfect replacement for category nodes and come with some vastly improved indexing capabilities. Cypher continues to make great leaps forward in usability and it’s worth keeping an eye on the new MERGE feature as well. And I haven’t even mentioned the shiny new browser interface that is not to be missed but will have to be the subject of another post!

See below for sections covering Neo4j in resources from Safari Books Online.

Safari Books Online has the content you need

data Spring Data shows you how Spring Data makes it relatively easy to build applications across a wide range of new data access technologies such as NoSQL and Hadoop. Read Neo4j: A Graph Database for some details on the graph database.
Cassandra: The Definitive Guide provides you with all of the details and practical examples you need to understand Cassandra’s non-relational database design and put it to work in a production environment.
Spring in Practice shows you how to tackle the challenges you face when you build Spring-based applications. The book empowers software developers to solve concrete business problems by mapping application-level issues to Spring-centric solutions. Read Creating a simple configuration item for more on Neo4j.

About the author

small Nigel began programming at an early age and has worked professionally in a variety of computing roles for over 15 years. His current areas of interest include Python, JavaScript, PostgreSQL, Neo4j and Linux. He has also founded a number of open source projects, most significantly py2neo, and is an active blogger, speaker and Neo4j community member and can be reached at @technige.

Tags: Cypher, MERGE, Neo4j, Neo4j 2.0, Py2neo, START,

One Response to “Neo4j 2.0: A Giant Leap For Graphkind”

  1. Brandon

    Since English is a Language and not a Person, perhaps the first merge adding Dave should read like this:

    MERGE (e:Language {name:'English'})