You are previewing MongoDB: The Definitive Guide.

MongoDB: The Definitive Guide

Cover of MongoDB: The Definitive Guide by Michael Dirolf... Published by O'Reilly Media, Inc.
  1. MongoDB: The Definitive Guide
  2. Foreword
  3. Preface
    1. How This Book Is Organized
      1. Getting Up to Speed with MongoDB
      2. Developing with MongoDB
      3. Advanced Usage
      4. Administration
      5. Developing Applications with MongoDB
      6. Appendixes
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
      1. Acknowledgments from Kristina
      2. Acknowledgments from Michael
  4. 1. Introduction
    1. A Rich Data Model
    2. Easy Scaling
    3. Tons of Features…
    4. …Without Sacrificing Speed
    5. Simple Administration
    6. But Wait, That’s Not All…
  5. 2. Getting Started
    1. Documents
    2. Collections
      1. Schema-Free
      2. Naming
    3. Databases
    4. Getting and Starting MongoDB
    5. MongoDB Shell
      1. Running the Shell
      2. A MongoDB Client
      3. Basic Operations with the Shell
      4. Tips for Using the Shell
    6. Data Types
      1. Basic Data Types
      2. Numbers
      3. Dates
      4. Arrays
      5. Embedded Documents
      6. _id and ObjectIds
  6. 3. Creating, Updating, and Deleting Documents
    1. Inserting and Saving Documents
      1. Batch Insert
      2. Inserts: Internals and Implications
    2. Removing Documents
      1. Remove Speed
    3. Updating Documents
      1. Document Replacement
      2. Using Modifiers
      3. Upserts
      4. Updating Multiple Documents
      5. Returning Updated Documents
    4. The Fastest Write This Side of Mississippi
      1. Safe Operations
      2. Catching “Normal” Errors
    5. Requests and Connections
  7. 4. Querying
    1. Introduction to find
      1. Specifying Which Keys to Return
      2. Limitations
    2. Query Criteria
      1. Query Conditionals
      2. OR Queries
      3. $not
      4. Rules for Conditionals
    3. Type-Specific Queries
      1. null
      2. Regular Expressions
      3. Querying Arrays
      4. Querying on Embedded Documents
    4. $where Queries
    5. Cursors
      1. Limits, Skips, and Sorts
      2. Avoiding Large Skips
      3. Advanced Query Options
      4. Getting Consistent Results
    6. Cursor Internals
  8. 5. Indexing
    1. Introduction to Indexing
      1. Scaling Indexes
      2. Indexing Keys in Embedded Documents
      3. Indexing for Sorts
      4. Uniquely Identifying Indexes
    2. Unique Indexes
      1. Dropping Duplicates
      2. Compound Unique Indexes
    3. Using explain and hint
    4. Index Administration
      1. Changing Indexes
    5. Geospatial Indexing
      1. Compound Geospatial Indexes
      2. The Earth Is Not a 2D Plane
  9. 6. Aggregation
    1. count
    2. distinct
    3. group
      1. Using a Finalizer
      2. Using a Function as a Key
    4. MapReduce
      1. Example 1: Finding All Keys in a Collection
      2. Example 2: Categorizing Web Pages
      3. MongoDB and MapReduce
  10. 7. Advanced Topics
    1. Database Commands
      1. How Commands Work
      2. Command Reference
    2. Capped Collections
      1. Properties and Use Cases
      2. Creating Capped Collections
      3. Sorting Au Naturel
      4. Tailable Cursors
    3. GridFS: Storing Files
      1. Getting Started with GridFS: mongofiles
      2. Working with GridFS from the MongoDB Drivers
      3. Under the Hood
    4. Server-Side Scripting
      1. db.eval
      2. Stored JavaScript
      3. Security
    5. Database References
      1. What Is a DBRef?
      2. Example Schema
      3. Driver Support for DBRefs
      4. When Should DBRefs Be Used?
  11. 8. Administration
    1. Starting and Stopping MongoDB
      1. Starting from the Command Line
      2. File-Based Configuration
      3. Stopping MongoDB
    2. Monitoring
      1. Using the Admin Interface
      2. serverStatus
      3. mongostat
      4. Third-Party Plug-Ins
    3. Security and Authentication
      1. Authentication Basics
      2. How Authentication Works
      3. Other Security Considerations
    4. Backup and Repair
      1. Data File Backup
      2. mongodump and mongorestore
      3. fsync and Lock
      4. Slave Backups
      5. Repair
  12. 9. Replication
    1. Master-Slave Replication
      1. Options
      2. Adding and Removing Sources
    2. Replica Sets
      1. Initializing a Set
      2. Nodes in a Replica Set
      3. Failover and Primary Election
    3. Performing Operations on a Slave
      1. Read Scaling
      2. Using Slaves for Data Processing
    4. How It Works
      1. The Oplog
      2. Syncing
      3. Replication State and the Local Database
      4. Blocking for Replication
    5. Administration
      1. Diagnostics
      2. Changing the Oplog Size
      3. Replication with Authentication
  13. 10. Sharding
    1. Introduction to Sharding
    2. Autosharding in MongoDB
      1. When to Shard
    3. The Key to Sharding: Shard Keys
      1. Sharding an Existing Collection
      2. Incrementing Shard Keys Versus Random Shard Keys
      3. How Shard Keys Affect Operations
    4. Setting Up Sharding
      1. Starting the Servers
      2. Sharding Data
    5. Production Configuration
      1. A Robust Config
      2. Many mongos
      3. A Sturdy Shard
      4. Physical Servers
    6. Sharding Administration
      1. config Collections
      2. Sharding Commands
  14. 11. Example Applications
    1. Chemical Search Engine: Java
      1. Installing the Java Driver
      2. Using the Java Driver
      3. Schema Design
      4. Writing This in Java
      5. Issues
    2. News Aggregator: PHP
      1. Installing the PHP Driver
      2. Using the PHP Driver
      3. Designing the News Aggregator
      4. Trees of Comments
      5. Voting
    3. Custom Submission Forms: Ruby
      1. Installing the Ruby Driver
      2. Using the Ruby Driver
      3. Custom Form Submission
      4. Ruby Object Mappers and Using MongoDB with Rails
    4. Real-Time Analytics: Python
      1. Installing PyMongo
      2. Using PyMongo
      3. MongoDB for Real-Time Analytics
      4. Schema
      5. Handling a Request
      6. Using Analytics Data
      7. Other Considerations
  15. A. Installing MongoDB
    1. Choosing a Version
    2. Windows Install
      1. Installing as a Service
    3. POSIX (Linux, Mac OS X, and Solaris) Install
      1. Installing from a Package Manager
  16. B. mongo: The Shell
    1. Shell Utilities
  17. C. MongoDB Internals
    1. BSON
    2. Wire Protocol
    3. Data Files
    4. Namespaces and Extents
    5. Memory-Mapped Storage Engine
  18. Index
  19. About the Authors
  20. Colophon
  21. Copyright
O'Reilly logo

Chapter 4. Querying

This chapter looks at querying in detail. The main areas covered are as follows:

  • You can perform ad hoc queries on the database using the find or findOne functions and a query document.

  • You can query for ranges, set inclusion, inequalities, and more by using $ conditionals.

  • Some queries cannot be expressed as query documents, even using $ conditionals. For these types of complex queries, you can use a $where clause to harness the full expressive power of JavaScript.

  • Queries return a database cursor, which lazily returns batches of documents as you need them.

  • There are a lot of metaoperations you can perform on a cursor, including skipping a certain number of results, limiting the number of results returned, and sorting results.

Introduction to find

The find method is used to perform queries in MongoDB. Querying returns a subset of documents in a collection, from no documents at all to the entire collection. Which documents get returned is determined by the first argument to find, which is a document specifying the query to be performed.

An empty query document (i.e., {}) matches everything in the collection. If find isn’t given a query document, it defaults to {}. For example, the following:

> db.c.find()

returns everything in the collection c.

When we start adding key/value pairs to the query document, we begin restricting our search. This works in a straightforward way for most types. Integers match integers, booleans match booleans, and strings match strings. Querying for a simple type is as easy as specifying the value that you are looking for. For example, to find all documents where the value for "age" is 27, we can add that key/value pair to the query document:

> db.users.find({"age" : 27})

If we have a string we want to match, such as a "username" key with the value "joe", we use that key/value pair instead:

> db.users.find({"username" : "joe"})

Multiple conditions can be strung together by adding more key/value pairs to the query document, which gets interpreted as “condition1 AND condition2 AND … AND conditionN.” For instance, to get all users who are 27-year-olds with the username “joe,” we can query for the following:

> db.users.find({"username" : "joe", "age" : 27})

Specifying Which Keys to Return

Sometimes, you do not need all of the key/value pairs in a document returned. If this is the case, you can pass a second argument to find (or findOne) specifying the keys you want. This reduces both the amount of data sent over the wire and the time and memory used to decode documents on the client side.

For example, if you have a user collection and you are interested only in the "username" and "email" keys, you could return just those keys with the following query:

> db.users.find({}, {"username" : 1, "email" : 1})
{
    "_id" : ObjectId("4ba0f0dfd22aa494fd523620"),
    "username" : "joe",
    "email" : "joe@example.com"
}

As you can see from the previous output, the "_id" key is always returned, even if it isn’t specifically listed.

You can also use this second parameter to exclude specific key/value pairs from the results of a query. For instance, you may have documents with a variety of keys, and the only thing you know is that you never want to return the "fatal_weakness" key:

> db.users.find({}, {"fatal_weakness" : 0})

This can even prevent "_id" from being returned:

> db.users.find({}, {"username" : 1, "_id" : 0})
{
    "username" : "joe",
}

Limitations

There are some restrictions on queries. The value of a query document must be a constant as far as the database is concerned. (It can be a normal variable in your own code.) That is, it cannot refer to the value of another key in the document. For example, if we were keeping inventory and we had both "in_stock" and "num_sold" keys, we could compare their values by querying the following:

> db.stock.find({"in_stock" : "this.num_sold"}) // doesn't work

There are ways to do this (see $where Queries), but you will usually get better performance by restructuring your document slightly, such that a normal query will suffice. In this example, we could instead use the keys "initial_stock" and "in_stock". Then, every time someone buys an item, we decrement the value of the "in_stock" key by one. Finally, we can do a simple query to check which items are out of stock:

> db.stock.find({"in_stock" : 0})

Query Criteria

Queries can go beyond the exact matching described in the previous section; they can match more complex criteria, such as ranges, OR-clauses, and negation.

Query Conditionals

"$lt", "$lte", "$gt", and "$gte" are all comparison operators, corresponding to <, <=, >, and >=, respectively. They can be combined to look for a range of values. For example, to look for users who are between the ages of 18 and 30 inclusive, we can do this:

> db.users.find({"age" : {"$gte" : 18, "$lte" : 30}})

These types of range queries are often useful for dates. For example, to find people who registered before January 1, 2007, we can do this:

> start = new Date("01/01/2007")
> db.users.find({"registered" : {"$lt" : start}})

An exact match on a date is less useful, because dates are only stored with millisecond precision. Often you want a whole day, week, or month, making a range query necessary.

To query for documents where a key’s value is not equal to a certain value, you must use another conditional operator, "$ne", which stands for “not equal.” If you want to find all users who do not have the username “joe,” you can query for them using this:

> db.users.find({"username" : {"$ne" : "joe"}})

"$ne" can be used with any type.

OR Queries

There are two ways to do an OR query in MongoDB. "$in" can be used to query for a variety of values for a single key. "$or" is more general; it can be used to query for any of the given values across multiple keys.

If you have more than one possible value to match for a single key, use an array of criteria with "$in". For instance, suppose we were running a raffle and the winning ticket numbers were 725, 542, and 390. To find all three of these documents, we can construct the following query:

> db.raffle.find({"ticket_no" : {"$in" : [725, 542, 390]}})

"$in" is very flexible and allows you to specify criteria of different types as well as values. For example, if we are gradually migrating our schema to use usernames instead of user ID numbers, we can query for either by using this:

> db.users.find({"user_id" : {"$in" : [12345, "joe"]})

This matches documents with a "user_id" equal to 12345, and documents with a "user_id" equal to "joe".

If "$in" is given an array with a single value, it behaves the same as directly matching the value. For instance, {ticket_no : {$in : [725]}} matches the same documents as {ticket_no : 725}.

The opposite of "$in" is "$nin", which returns documents that don’t match any of the criteria in the array. If we want to return all of the people who didn’t win anything in the raffle, we can query for them with this:

> db.raffle.find({"ticket_no" : {"$nin" : [725, 542, 390]}})

This query returns everyone who did not have tickets with those numbers.

"$in" gives you an OR query for a single key, but what if we need to find documents where "ticket_no" is 725 or "winner" is true? For this type of query, we’ll need to use the "$or" conditional. "$or" takes an array of possible criteria. In the raffle case, using "$or" would look like this:

> db.raffle.find({"$or" : [{"ticket_no" : 725}, {"winner" : true}]})

"$or" can contain other conditionals. If, for example, we want to match any of the three "ticket_no" values or the "winner" key, we can use this:

> db.raffle.find({"$or" : [{"ticket_no" : {"$in" : [725, 542, 390]}}, 
                           {"winner" : true}]})

With a normal AND-type query, you want to narrow your results down as far as possible in as few arguments as possible. OR-type queries are the opposite: they are most efficient if the first arguments match as many documents as possible.

$not

"$not" is a metaconditional: it can be applied on top of any other criteria. As an example, let’s consider the modulus operator, "$mod". "$mod" queries for keys whose values, when divided by the first value given, have a remainder of the second value:

> db.users.find({"id_num" : {"$mod" : [5, 1]}})

The previous query returns users with "id_num"s of 1, 6, 11, 16, and so on. If we want, instead, to return users with "id_num"s of 2, 3, 4, 5, 7, 8, 9, 10, 12, and so on, we can use "$not":

> db.users.find({"id_num" : {"$not" : {"$mod" : [5, 1]}}})

"$not" can be particularly useful in conjunction with regular expressions to find all documents that don’t match a given pattern (regular expression usage is described in the section Regular Expressions).

Rules for Conditionals

If you look at the update modifiers in the previous chapter and previous query documents, you’ll notice that the $-prefixed keys are in different positions. In the query, "$lt" is in the inner document; in the update, "$inc" is the key for the outer document. This generally holds true: conditionals are an inner document key, and modifiers are always a key in the outer document.

Multiple conditions can be put on a single key. For example, to find all users between the ages of 20 and 30, we can query for both "$gt" and "$lt" on the "age" key:

> db.users.find({"age" : {"$lt" : 30, "$gt" : 20}})

Any number of conditionals can be used with a single key. Multiple update modifiers cannot be used on a single key, however. For example, you cannot have a modifier document such as {"$inc" : {"age" : 1}, "$set" : {age : 40}} because it modifies "age" twice. With query conditionals, no such rule applies.

Type-Specific Queries

As covered in Chapter 2, MongoDB has a wide variety of types that can be used in a document. Some of these behave specially in queries.

null

null behaves a bit strangely. It does match itself, so if we have a collection with the following documents:

> db.c.find()
{ "_id" : ObjectId("4ba0f0dfd22aa494fd523621"), "y" : null }
{ "_id" : ObjectId("4ba0f0dfd22aa494fd523622"), "y" : 1 }
{ "_id" : ObjectId("4ba0f148d22aa494fd523623"), "y" : 2 }

we can query for documents whose "y" key is null in the expected way:

> db.c.find({"y" : null})
{ "_id" : ObjectId("4ba0f0dfd22aa494fd523621"), "y" : null }

However, null not only matches itself but also matches “does not exist.” Thus, querying for a key with the value null will return all documents lacking that key:

> db.c.find({"z" : null})
{ "_id" : ObjectId("4ba0f0dfd22aa494fd523621"), "y" : null }
{ "_id" : ObjectId("4ba0f0dfd22aa494fd523622"), "y" : 1 }
{ "_id" : ObjectId("4ba0f148d22aa494fd523623"), "y" : 2 }

If we only want to find keys whose value is null, we can check that the key is null and exists using the "$exists" conditional:

> db.c.find({"z" : {"$in" : [null], "$exists" : true}})

Unfortunately, there is no "$eq" operator, which makes this a little awkward, but "$in" with one element is equivalent.

Regular Expressions

Regular expressions are useful for flexible string matching. For example, if we want to find all users with the name Joe or joe, we can use a regular expression to do case-insensitive matching:

> db.users.find({"name" : /joe/i})

Regular expression flags (i) are allowed but not required. If we want to match not only various capitalizations of joe, but also joey, we can continue to improve our regular expression:

> db.users.find({"name" : /joey?/i})

MongoDB uses the Perl Compatible Regular Expression (PCRE) library to match regular expressions; any regular expression syntax allowed by PCRE is allowed in MongoDB. It is a good idea to check your syntax with the JavaScript shell before using it in a query to make sure it matches what you think it matches.

Note

MongoDB can leverage an index for queries on prefix regular expressions (e.g., /^joey/), so queries of that kind can be fast.

Regular expressions can also match themselves. Very few people insert regular expressions into the database, but if you insert one, you can match it with itself:

> db.foo.insert({"bar" : /baz/})
> db.foo.find({"bar" : /baz/})
{
    "_id" : ObjectId("4b23c3ca7525f35f94b60a2d"),
    "bar" : /baz/
}

Querying Arrays

Querying for elements of an array is simple. An array can mostly be treated as though each element is the value of the overall key. For example, if the array is a list of fruits, like this:

> db.food.insert({"fruit" : ["apple", "banana", "peach"]})

the following query:

> db.food.find({"fruit" : "banana"})

will successfully match the document. We can query for it in much the same way as though we had a document that looked like the (illegal) document: {"fruit" : "apple", "fruit" : "banana", "fruit" : "peach"}.

$all

If you need to match arrays by more than one element, you can use "$all". This allows you to match a list of elements. For example, suppose we created a collection with three elements:

> db.food.insert({"_id" : 1, "fruit" : ["apple", "banana", "peach"]})
> db.food.insert({"_id" : 2, "fruit" : ["apple", "kumquat", "orange"]})
> db.food.insert({"_id" : 3, "fruit" : ["cherry", "banana", "apple"]})

Then we can find all documents with both "apple" and "banana" elements by querying with "$all":

> db.food.find({fruit : {$all : ["apple", "banana"]}})
      {"_id" : 1, "fruit" : ["apple", "banana", "peach"]}
      {"_id" : 3, "fruit" : ["cherry", "banana", "apple"]}

Order does not matter. Notice "banana" comes before "apple" in the second result. Using a one-element array with "$all" is equivalent to not using "$all". For instance, {fruit : {$all : ['apple']} will match the same documents as {fruit : 'apple'}.

You can also query by exact match using the entire array. However, exact match will not match a document if any elements are missing or superfluous. For example, this will match the first document shown previously:

> db.food.find({"fruit" : ["apple", "banana", "peach"]})

But this will not:

> db.food.find({"fruit" : ["apple", "banana"]})

and neither will this:

> db.food.find({"fruit" : ["banana", "apple", "peach"]})

If you want to query for a specific element of an array, you can specify an index using the syntax key.index:

> db.food.find({"fruit.2" : "peach"}) 

Arrays are always 0-indexed, so this would match the third array element against the string "peach".

$size

A useful conditional for querying arrays is "$size", which allows you to query for arrays of a given size. Here’s an example:

> db.food.find({"fruit" : {"$size" : 3}})

One common query is to get a range of sizes. "$size" cannot be combined with another $ conditional (in this example, "$gt"), but this query can be accomplished by adding a "size" key to the document. Then, every time you add an element to the array, increment the value of "size". If the original update looked like this:

> db.food.update({"$push" : {"fruit" : "strawberry"}})

it can simply be changed to this:

> db.food.update({"$push" : {"fruit" : "strawberry"}, "$inc" : {"size" : 1}})

Incrementing is extremely fast, so any performance penalty is negligible. Storing documents like this allows you to do queries such as this:

> db.food.find({"size" : {"$gt" : 3}})

Unfortunately, this technique doesn’t work as well with the "$addToSet" operator.

The $slice operator

As mentioned earlier in this chapter, the optional second argument to find specifies the keys to be returned. The special "$slice" operator can be used to return a subset of elements for an array key.

For example, suppose we had a blog post document and we wanted to return the first 10 comments:

> db.blog.posts.findOne(criteria, {"comments" : {"$slice" : 10}})

Alternatively, if we wanted the last 10 comments, we could use -10:

> db.blog.posts.findOne(criteria, {"comments" : {"$slice" : -10}})

"$slice" can also return pages in the middle of the results by taking an offset and the number of elements to return:

> db.blog.posts.findOne(criteria, {"comments" : {"$slice" : [23, 10]}})

This would skip the first 23 elements and return the 24th through 34th. If there are fewer than 34 elements in the array, it will return as many as possible.

Unless otherwise specified, all keys in a document are returned when "$slice" is used. This is unlike the other key specifiers, which suppress unmentioned keys from being returned. For instance, if we had a blog post document that looked like this:

{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "title" : "A blog post",
    "content" : "...",
    "comments" : [
        {
            "name" : "joe",
            "email" : "joe@example.com",
            "content" : "nice post."
        },
        {
            "name" : "bob",
            "email" : "bob@example.com",
            "content" : "good post."
        }
    ]
}

and we did a "$slice" to get the last comment, we’d get this:

> db.blog.posts.findOne(criteria, {"comments" : {"$slice" : -1}})
{
    "_id" : ObjectId("4b2d75476cc613d5ee930164"),
    "title" : "A blog post",
    "content" : "...",
    "comments" : [
        {
            "name" : "bob",
            "email" : "bob@example.com",
            "content" : "good post."
        }
    ]
}

Both "title" and "content" are still returned, even though they weren’t explicitly included in the key specifier.

Querying on Embedded Documents

There are two ways of querying for an embedded document: querying for the whole document or querying for its individual key/value pairs.

Querying for an entire embedded document works identically to a normal query. For example, if we have a document that looks like this:

{
    "name" : {
        "first" : "Joe",
        "last" : "Schmoe"
    },
    "age" : 45
}

we can query for someone named Joe Schmoe with the following:

> db.people.find({"name" : {"first" : "Joe", "last" : "Schmoe"}})

However, if Joe decides to add a middle name key, suddenly this query won’t work anymore; it doesn’t match the entire embedded document! This type of query is also order-sensitive; {"last" : "Schmoe", "first" : "Joe"} would not be a match.

If possible, it’s usually a good idea to query for just a specific key or keys of an embedded document. Then, if your schema changes, all of your queries won’t suddenly break because they’re no longer exact matches. You can query for embedded keys using dot-notation:

> db.people.find({"name.first" : "Joe", "name.last" : "Schmoe"})

Now, if Joe adds more keys, this query will still match his first and last names.

This dot-notation is the main difference between query documents and other document types. Query documents can contain dots, which mean “reach into an embedded document.” Dot-notation is also the reason that documents to be inserted cannot contain the . character. Oftentimes people run into this limitation when trying to save URLs as keys. One way to get around it is to always perform a global replace before inserting or after retrieving, substituting a character that isn’t legal in URLs for the dot (.) character.

Embedded document matches can get a little tricky as the document structure gets more complicated. For example, suppose we are storing blog posts and we want to find comments by Joe that were scored at least a five. We could model the post as follows:

> db.blog.find()
{
    "content" : "...",
    "comments" : [
        {
            "author" : "joe",
            "score" : 3,
            "comment" : "nice post"
        },
        {
            "author" : "mary",
            "score" : 6,
            "comment" : "terrible post"
        }
    ]
}

Now, we can’t query using db.blog.find({"comments" : {"author" : "joe", "score" : {"$gte" : 5}}}). Embedded document matches have to match the whole document, and this doesn’t match the "comment" key. It also wouldn’t work to do db.blog.find({"comments.author" : "joe", "comments.score" : {"$gte" : 5}}), because the author criteria could match a different comment than the score criteria. That is, it would return the document shown earlier; it would match "author" : "joe" in the first comment and "score" : 6 in the second comment.

To correctly group criteria without needing to specify every key, use "$elemMatch". This vaguely named conditional allows you to partially specify criteria to match a single embedded document in an array. The correct query looks like this:

> db.blog.find({"comments" : {"$elemMatch" : {"author" : "joe", 
                                              "score" : {"$gte" : 5}}}})

"$elemMatch" allows us to “group” our criteria. As such, it’s only needed when you have more than one key you want to match on in an embedded document.

$where Queries

Key/value pairs are a fairly expressive way to query, but there are some queries that they cannot represent. For queries that cannot be done any other way, there are "$where" clauses, which allow you to execute arbitrary JavaScript as part of your query. This allows you to do (almost) anything within a query.

The most common case for this is wanting to compare the values for two keys in a document, for instance, if we had a list of items and wanted to return documents where any two of the values are equal. Here’s an example:

> db.foo.insert({"apple" : 1, "banana" : 6, "peach" : 3})
> db.foo.insert({"apple" : 8, "spinach" : 4, "watermelon" : 4})

In the second document, "spinach" and "watermelon" have the same value, so we’d like that document returned. It’s unlikely MongoDB will ever have a $ conditional for this, so we can use a "$where" clause to do it with JavaScript:

> db.foo.find({"$where" : function () {
... for (var current in this) {
...     for (var other in this) {
...         if (current != other && this[current] == this[other]) {
...             return true;
...         }
...     }
... }
... return false;
... }});

If the function returns true, the document will be part of the result set; if it returns false, it won’t be.

We used a function earlier, but you can also use strings to specify a "$where" query; the following two "$where" queries are equivalent:

> db.foo.find({"$where" : "this.x + this.y == 10"})
> db.foo.find({"$where" : "function() { return this.x + this.y == 10; }"})

"$where" queries should not be used unless strictly necessary: they are much slower than regular queries. Each document has to be converted from BSON to a JavaScript object and then run through the "$where" expression. Indexes cannot be used to satisfy a "$where", either. Hence, you should use "$where" only when there is no other way of doing the query. You can cut down on the penalty by using other query filters in combination with "$where". If possible, an index will be used to filter based on the non-$where clauses; the "$where" expression will be used only to fine-tune the results.

Another way of doing complex queries is to use MapReduce, which is covered in the next chapter.

Cursors

The database returns results from find using a cursor. The client-side implementations of cursors generally allow you to control a great deal about the eventual output of a query. You can limit the number of results, skip over some number of results, sort results by any combination of keys in any direction, and perform a number of other powerful operations.

To create a cursor with the shell, put some documents into a collection, do a query on them, and assign the results to a local variable (variables defined with "var" are local). Here, we create a very simple collection and query it, storing the results in the cursor variable:

> for(i=0; i<100; i++) {
... db.c.insert({x : i});
... }
> var cursor = db.collection.find();

The advantage of doing this is that you can look at one result at a time. If you store the results in a global variable or no variable at all, the MongoDB shell will automatically iterate through and display the first couple of documents. This is what we’ve been seeing up until this point, and it is often the behavior you want for seeing what’s in a collection but not for doing actual programming with the shell.

To iterate through the results, you can use the next method on the cursor. You can use hasNext to check whether there is another result. A typical loop through results looks like the following:

> while (cursor.hasNext()) {
... obj = cursor.next();
... // do stuff
... }

cursor.hasNext() checks that the next result exists, and cursor.next() fetches it.

The cursor class also implements the iterator interface, so you can use it in a forEach loop:

> var cursor = db.people.find();
> cursor.forEach(function(x) {
... print(x.name);
... });
adam
matt
zak

When you call find, the shell does not query the database immediately. It waits until you actually start requesting results to send the query, which allows you to chain additional options onto a query before it is performed. Almost every method on a cursor object returns the cursor itself so that you can chain them in any order. For instance, all of the following are equivalent:

> var cursor = db.foo.find().sort({"x" : 1}).limit(1).skip(10);
> var cursor = db.foo.find().limit(1).sort({"x" : 1}).skip(10);
> var cursor = db.foo.find().skip(10).limit(1).sort({"x" : 1});

At this point, the query has not been executed yet. All of these functions merely build the query. Now, suppose we call the following:

> cursor.hasNext()

At this point, the query will be sent to the server. The shell fetches the first 100 results or first 4MB of results (whichever is smaller) at once so that the next calls to next or hasNext will not have to make trips to the server. After the client has run through the first set of results, the shell will again contact the database and ask for more results. This process continues until the cursor is exhausted and all results have been returned.

Limits, Skips, and Sorts

The most common query options are limiting the number of results returned, skipping a number of results, and sorting. All of these options must be added before a query is sent to the database.

To set a limit, chain the limit function onto your call to find. For example, to only return three results, use this:

> db.c.find().limit(3)

If there are fewer than three documents matching your query in the collection, only the number of matching documents will be returned; limit sets an upper limit, not a lower limit.

skip works similarly to limit:

> db.c.find().skip(3)

This will skip the first three matching documents and return the rest of the matches. If there are less than three documents in your collection, it will not return any documents.

sort takes an object: a set of key/value pairs where the keys are key names and the values are the sort directions. Sort direction can be 1 (ascending) or -1 (descending). If multiple keys are given, the results will be sorted in that order. For instance, to sort the results by "username" ascending and "age" descending, we do the following:

> db.c.find().sort({username : 1, age : -1})

These three methods can be combined. This is often handy for pagination. For example, suppose that you are running an online store and someone searches for mp3. If you want 50 results per page sorted by price from high to low, you can do the following:

> db.stock.find({"desc" : "mp3"}).limit(50).sort({"price" : -1})

If they click Next Page to see more results, you can simply add a skip to the query, which will skip over the first 50 matches (which the user already saw on page 1):

> db.stock.find({"desc" : "mp3"}).limit(50).skip(50).sort({"price" : -1})

However, large skips are not very performant, so there are suggestions on avoiding them in a moment.

Comparison order

MongoDB has a hierarchy as to how types compare. Sometimes you will have a single key with multiple types, for instance, integers and booleans, or strings and nulls. If you do a sort on a key with a mix of types, there is a predefined order that they will be sorted in. From least to greatest value, this ordering is as follows:

  1. Minimum value

  2. null

  3. Numbers (integers, longs, doubles)

  4. Strings

  5. Object/document

  6. Array

  7. Binary data

  8. Object ID

  9. Boolean

  10. Date

  11. Timestamp

  12. Regular expression

  13. Maximum value

Avoiding Large Skips

Using skip for a small number of documents is fine. For a large number of results, skip can be slow (this is true in nearly every database, not just MongoDB) and should be avoided. Usually you can build criteria into the documents themselves to avoid having to do large skips, or you can calculate the next query based on the result from the previous one.

Paginating results without skip

The easiest way to do pagination is to return the first page of results using limit and then return each subsequent page as an offset from the beginning.

> // do not use: slow for large skips
> var page1 = db.foo.find(criteria).limit(100)
> var page2 = db.foo.find(criteria).skip(100).limit(100)
> var page3 = db.foo.find(criteria).skip(200).limit(100)
...

However, depending on your query, you can usually find a way to paginate without skips. For example, suppose we want to display documents in descending order based on "date". We can get the first page of results with the following:

> var page1 = db.foo.find().sort({"date" : -1}).limit(100)

Then, we can use the "date" value of the last document as the criteria for fetching the next page:

var latest = null;

// display first page
while (page1.hasNext()) {
   latest = page1.next();
   display(latest);
}

// get next page
var page2 = db.foo.find({"date" : {"$gt" : latest.date}});
page2.sort({"date" : -1}).limit(100);

Now the query does not need to include a skip.

Finding a random document

One fairly common problem is how to get a random document from a collection. The naive (and slow) solution is to count the number of documents and then do a find, skipping a random number of documents between 0 and the size of the collection.

> // do not use
> var total = db.foo.count()
> var random = Math.floor(Math.random()*total)
> db.foo.find().skip(random).limit(1)

It is actually highly inefficient to get a random element this way: you have to do a count (which can be expensive if you are using criteria), and skipping large numbers of elements can be time-consuming.

It takes a little forethought, but if you know you’ll be looking up a random element on a collection, there’s a much more efficient way to do so. The trick is to add an extra random key to each document when it is inserted. For instance, if we’re using the shell, we could use the Math.random() function (which creates a random number between 0 and 1):

> db.people.insert({"name" : "joe", "random" : Math.random()})
> db.people.insert({"name" : "john", "random" : Math.random()})
> db.people.insert({"name" : "jim", "random" : Math.random()})

Now, when we want to find a random document from the collection, we can calculate a random number and use that as query criteria, instead of doing a skip:

> var random = Math.random()
> result = db.foo.findOne({"random" : {"$gt" : random}})

There is a slight chance that random will be greater than any of the "random" values in the collection, and no results will be returned. We can guard against this by simply returning a document in the other direction:

> if (result == null) {
... result = db.foo.findOne({"random" : {"$lt" : random}})
... }

If there aren’t any documents in the collection, this technique will end up returning null, which makes sense.

This technique can be used with arbitrarily complex queries; just make sure to have an index that includes the random key. For example, if we want to find a random plumber in California, we can create an index on "profession", "state", and "random":

> db.people.ensureIndex({"profession" : 1, "state" : 1, "random" : 1})

This allows us to quickly find a random result (see Chapter 5 for more information on indexing).

Advanced Query Options

There are two types of queries: wrapped and plain. A plain query is something like this:

> var cursor = db.foo.find({"foo" : "bar"})

There are a couple options that “wrap” the query. For example, suppose we perform a sort:

> var cursor = db.foo.find({"foo" : "bar"}).sort({"x" : 1})

Instead of sending {"foo" : "bar"} to the database as the query, the query gets wrapped in a larger document. The shell converts the query from {"foo" : "bar"} to {"$query" : {"foo" : "bar"}, "$orderby" : {"x" : 1}}.

Most drivers provide helpers for adding arbitrary options to queries. Other helpful options include the following:

$maxscan : integer

Specify the maximum number of documents that should be scanned for the query.

$min : document

Start criteria for querying.

$max : document

End criteria for querying.

$hint : document

Tell the server which index to use for the query.

$explain : boolean

Get an explanation of how the query will be executed (indexes used, number of results, how long it takes, etc.), instead of actually doing the query.

$snapshot : boolean

Ensure that the query’s results will be a consistent snapshot from the point in time when the query was executed. See the next section for details.

Getting Consistent Results

A fairly common way of processing data is to pull it out of MongoDB, change it in some way, and then save it again:

cursor = db.foo.find();

while (cursor.hasNext()) {
    var doc = cursor.next();
    doc = process(doc);
    db.foo.save(doc);
}

This is fine for a small number of results, but it breaks down for large numbers of documents. To see why, imagine how the documents are actually being stored. You can picture a collection as a list of documents that looks something like Figure 4-1. Snowflakes represent documents, because every document is beautiful and unique.

A collection being queried

Figure 4-1. A collection being queried

Now, when we do a find, it starts returning results from the beginning of the collection and moves right. Your program grabs the first 100 documents and processes them. When you save them back to the database, if a document does not have the padding available to grow to its new size, like in Figure 4-2, it needs to be relocated. Usually, a document will be relocated to the end of a collection (Figure 4-3).

An enlarged document may not fit where it did before

Figure 4-2. An enlarged document may not fit where it did before

MongoDB relocates updated documents that don’t fit in their original position

Figure 4-3. MongoDB relocates updated documents that don’t fit in their original position

Now our program continues to fetch batches of documents. When it gets toward the end, it will return the relocated documents again (Figure 4-4)!

A cursor may return these relocated documents again in a later batch

Figure 4-4. A cursor may return these relocated documents again in a later batch

The solution to this problem is to snapshot your query. If you add the "$snapshot" option, the query will be run against an unchanging view of the collection. All queries that return a single batch of results are effectively snapshotted. Inconsistencies arise only when the collection changes under a cursor while it is waiting to get another batch of results.

Cursor Internals

There are two sides to a cursor: the client-facing cursor and the database cursor that the client-side one represents. We have been talking about the client-side one up until now, but we are going to take a brief look at what’s happening on the server side.

On the server side, a cursor takes up memory and resources. Once a cursor runs out of results or the client sends a message telling it to die, the database can free the resources it was using. Freeing these resources lets the database use them for other things, which is good, so we want to make sure that cursors can be freed quickly (within reason).

There are a couple of conditions that can cause the death (and subsequent cleanup) of a cursor. First, when a cursor finishes iterating through the matching results, it will clean itself up. Another way is that, when a cursor goes out of scope on the client side, the drivers send the database a special message to let it know that it can kill that cursor. Finally, even if the user hasn’t iterated through all the results and the cursor is still in scope, after 10 minutes of inactivity, a database cursor will automatically “die.”

This “death by timeout” is usually the desired behavior: very few applications expect their users to sit around for minutes at a time waiting for results. However, sometimes you might know that you need a cursor to last for a long time. In that case, many drivers have implemented a function called immortal, or a similar mechanism, which tells the database not to time out the cursor. If you turn off a cursor’s timeout, you must iterate through all of its results or make sure it gets closed. Otherwise, it will sit around in the database hogging resources forever.

The best content for your career. Discover unlimited learning on demand for around $1/day.