Chapter 1. To Embed or Reference

When building a new application, often one of the first things you’ll want to do is to design its data model. In relational databases such as MySQL, this step is formalized in the process of normalization, focused on removing redundancy from a set of tables. MongoDB, unlike relational databases, stores its data in structured documents rather than the fixed tables required in relational databases. For instance, relational tables typically require each row-column intersection to contain a single, scalar value. MongoDB BSON documents allow for more complex structure by supporting arrays of values (where each array itself may be composed of multiple subdocuments).

This chapter explores one of the options that MongoDB’s rich document model leaves open to you: the question of whether you should embed related objects within one another or reference them by ID. Here, you’ll learn how to weigh performance, flexibility, and complexity against one another as you make this decision.

Relational Data Modeling and Normalization

Before jumping into MongoDB’s approach to the question of embedding documents or linking documents, we’ll take a little detour into how you model certain types of relationships in relational (SQL) databases. In relational databases, data modeling typically progresses by modeling your data as a series of tables, consisting of rows and columns, which collectively define the schema of your data. Relational database theory has defined a number of ways of putting application data into tables, referred to as normal forms. Although a detailed discussion of relational modeling goes beyond the scope of this text, there are two forms that are of particular interest to us here: first normal form and third normal form.

What Is a Normal Form, Anyway?

Schema normalization typically begins by putting your application data into the first normal form (1NF). Although there are specific rules that define exactly what 1NF means, that’s a little beyond what we want to cover here. For our purposes, we can consider 1NF data to be any data that’s tabular (composed of rows and columns), with each row-column intersection (“cell”) containing exactly one value. This requirement that each cell contains exactly one value is, as we’ll see later, a requirement that MongoDB does not impose, with the potential for some nice performance gains. Back in our relational case, let’s consider a phone book application. Your initial data might be of the following form, shown in Table 1-1.

Table 1-1. Phone book v1
id name phone_number zip_code

1

Rick

555-111-1234

30062

2

Mike

555-222-2345

30062

3

Jenny

555-333-3456

01209

This data is actually already in first normal form. Suppose, however, that we wished to allow for multiple phone numbers for each contact, as in Table 1-2.

Table 1-2. Phone book v2
id name phone_numbers zip_code

1

Rick

555-111-1234

30062

2

Mike

555-222-2345;555-212-2322

30062

3

Jenny

555-333-3456;555-334-3411

01209

Now we have a table that’s no longer in first normal form. If we were to actually store data in this form in a relational database, we would have to decide whether to store phone_numbers as an unstructured BLOB of text or as separate columns (i.e., phone_number0, phone_number1). Suppose we decided to store phone_numbers as a text column, as shown in Table 1-2. If we needed to implement something like caller ID, finding the name for a given phone number, our SQL query would look something like the following:

SELECT name FROM contacts WHERE phone_numbers LIKE '%555-222-2345%';

Unfortunately, using a LIKE clause that’s not a prefix means that this query requires a full table scan to be satisfied.

Alternatively, we can use multiple columns, one for each phone number, as shown in Table 1-3.

Table 1-3. Phone book v2.1 (multiple columns)
id name phone_number0 phone_number1 zip_code

1

Rick

555-111-1234

NULL

30062

2

Mike

555-222-2345

555-212-2322

30062

3

Jenny

555-333-3456

555-334-3411

01209

In this case, our caller ID query becomes quite verbose:

SELECT name FROM contacts
    WHERE phone_number0='555-222-2345'
        OR phone_number1='555-222-2345';

Updates are also more complicated, particularly deleting a phone number, since we either need to parse the phone_numbers field and rewrite it or find and nullify the matching phone number field. First normal form addresses these issues by breaking up multiple phone numbers into multiple rows, as in Table 1-4.

Table 1-4. Phone book v3
id name phone_number zip_code

1

Rick

555-111-1234

30062

2

Mike

555-222-2345

30062

2

Mike

555-212-2322

30062

2

Jenny

555-333-3456

01209

2

Jenny

555-334-3411

01209

Now we’re back to first normal form, but we had to introduce some redundancy into our data model. The problem with redundancy, of course, is that it introduces the possibility of inconsistency, where various copies of the same data have different values. To remove this redundancy, we need to further normalize the data by splitting it into two tables: Table 1-5 and Table 1-6. (And don’t worry, we’ll be getting back to MongoDB and how it can solve your redundancy problems without normalization really soon now.)

Table 1-5. Phone book v4 (contacts)
contact_id name zip_code

1

Rick

30062

2

Mike

30062

3

Jenny

01209

Table 1-6. Phone book v4 (numbers)
contact_id phone_number

1

555-111-1234

2

555-222-2345

2

555-212-2322

3

555-333-3456

3

555-334-3411

As part of this step, we must identify a key column which uniquely identifies each row in the table so that we can create links between the tables. In the data model presented in Table 1-5 and Table 1-6, the contact_id forms the key of the contacts table, and the (contact_id, phone_number) pair forms the key of the numbers table. In this case, we have a data model that is free of redundancy, allowing us to update a contact’s name, zip code, or various phone numbers without having to worry about updating multiple rows. In particular, we no longer need to worry about inconsistency in the data model.

So What’s the Problem?

As already mentioned, the nice thing about normalization is that it allows for easy updating without any redundancy. Each fact about the application domain can be updated by changing just one value, at one row-column intersection. The problem arises when you try to get the data back out. For instance, in our phone book application, we may want to have a form that displays a contact along with all of his or her phone numbers. In cases like these, the relational database programmer reaches for a JOIN:

SELECT name, phone_number
  FROM contacts LEFT JOIN numbers
    ON contacts.contact_id=numbers.contact_id
  WHERE contacts.contact_id=3;

The result of this query? A result set like that shown in Table 1-7.

Table 1-7. Result of JOIN query
name phone_number

Jenny

555-333-3456

Jenny

555-334-3411

Indeed, the database has given us all the data we need to satisfy our screen design. The real problem is in what the database had to do to create this result set, particularly if the database is backed by a spinning magnetic disk. To see why, we need to briefly look at some of the physical characteristics of such devices.

Spinning disks have the property that it takes much longer to seek to a particular location on the disk than it does, once there, to sequentially read data from the disk (see Figure 1-1). For instance, a modern disk might take 5 milliseconds to seek to the place where it can begin reading. Once it is there, however, it can read data at a rate of 40–80 MBs per second. For an application like our phone book, then, assuming a generous 1,024 bytes per row, reading a row off the disk would take between 12 and 25 microseconds.

Disk seek versus sequential access
Figure 1-1. Disk seek versus sequential access

The end result of all this math? The seek takes well over 99% of the time spent reading a row. When it comes to disk access, random seeks are the enemy. The reason why this is so important in this context is because JOINs typically require random seeks. Given our normalized data model, a likely plan for our query would be something similar to the following Python code:

for number_row in find_by_contact_id(numbers, 3):
    yield (contact_row.name, number_row.number)

So there ends up being at least one disk seek for every contact in our database. Of course, we’ve glossed over how find_by_contact_id works, assuming that all it needs to do is a single disk seek. Typically, this is actually accomplished by reading an index on numbers that is keyed by contact_id, potentially resulting in even more disk seeks.

Of course, modern database systems have evolved structures to mitigate some of this, largely by caching frequently used objects (particularly indexes) in RAM. However, even with such optimizations, joining tables is one of the most expensive operations that relational databases do. Additionally, if you end up needing to scale your database to multiple servers, you introduce the problem of generating a distributed join, a complex and generally slow operation.

Denormalizing for Performance

The dirty little secret (which isn’t really so secret) about relational databases is that once we have gone through the data modeling process to generate our nice nth normal form data model, it’s often necessary to denormalize the model to reduce the number of JOIN operations required for the queries we execute frequently.

In this case, we might just revert to storing the name and contact_id redundantly in the row. Of course, doing this results in the redundancy we were trying to get away from, and leads to greater application complexity, as we have to make sure to update data in all its redundant locations.

MongoDB: Who Needs Normalization, Anyway?

Into this mix steps MongoDB with the notion that your data doesn’t always have to be tabular, basically throwing most of traditional database normalization out, starting with first normal form. In MongoDB, data is stored in documents. This means that where the first normal form in relational databases required that each row-column intersection contain exactly one value, MongoDB allows you to store an array of values if you so desire.

Fortunately for us as application designers, that opens up some new possibilities in schema design. Because MongoDB can natively encode such multivalued properties, we can get many of the performance benefits of a denormalized form without the attendant difficulties in updating redundant data. Unfortunately for us, it also complicates our schema design process. There is no longer a “garden path” of normalized database design to go down, and the go-to answer when faced with general schema design problems in MongoDB is “it depends.”

MongoDB Document Format

Before getting into detail about when and why to use MongoDB’s array types, let’s review just what a MongoDB document is. Documents in MongoDB are modeled after the JSON (JavaScript Object Notation) format, but are actually stored in BSON (Binary JSON). Briefly, what this means is that a MongoDB document is a dictionary of key-value pairs, where the value may be one of a number of types:

  • Primitive JSON types (e.g., number, string, Boolean)
  • Primitive BSON types (e.g., datetime, ObjectId, UUID, regex)
  • Arrays of values
  • Objects composed of key-value pairs
  • Null

In our example phone book application, we might store Jenny’s contact information in a document as follows:

{
  "_id": 3,
  "name": "Jenny",
  "zip_code": "01209",
  "numbers": [ "555-333-3456", "555-334-3411" ]
}

As you can see, we’re now able to store contact information in the initial Table 1-2 format without going through the process of normalization. Alternatively, we could “normalize” our model to remove the array, referencing the contact document by its _id field:

// Contact document:
{
  "_id": 3,
  "name": "Jenny",
  "zip_code": "01209"
}

// Number documents:
{ "contact_id": 3, "number": "555-333-3456" }
{ "contact_id": 3, "number": "555-334-3411" }

The remainder of this chapter is devoted to helping you decide whether referencing or embedding is the correct solution in various contexts.

Embedding for Locality

One reason you might want to embed your one-to-many relationships is data locality. As discussed earlier, spinning disks are very good at sequential data transfer and very bad at random seeking. And since MongoDB stores documents contiguously on disk, putting all the data you need into one document means that you’re never more than one seek away from everything you need.

MongoDB also has a limitation (driven by the desire for easy database partitioning) that there are no JOIN operations available. For instance, if you used referencing in the phone book application, your application might do something like the following:

contact_info = db.contacts.find_one({'_id': 3})
number_info = list(db.numbers.find({'contact_id': 3})

If we take this approach, however, we’re left with a problem that’s actually worse than a relational ‘JOIN` operation. Not only does the database still have to do multiple seeks to find our data, but we’ve also introduced additional latency into the lookup since it now takes two round-trips to the database to retrieve our data. Thus, if your application frequently accesses contacts’ information along with all their phone numbers, you’ll almost certainly want to embed the numbers within the contact record.

Embedding for Atomicity and Isolation

Another concern that weighs in favor of embedding is the desire for atomicity and isolation in writing data. When we update data in our database, we want to ensure that our update either succeeds or fails entirely, never having a “partial success,” and that any other database reader never sees an incomplete write operation. Relational databases achieve this by using multistatement transactions. For instance, if we want to DELETE Jenny from our normalized database, we might execute code similar to the following:

BEGIN TRANSACTION;
DELETE FROM contacts WHERE contact_id=3;
DELETE FROM numbers WHERE contact_id=3;
COMMIT;

The problem with using this approach in MongoDB is that MongoDB is designed without multidocument transactions. If we tried to delete Jenny from our “normalized” MongoDB schema, we would need to execute the following code:

db.contacts.remove({'_id': 3})
db.numbers.remove({'contact_id': 3})

Why no transactions?

MongoDB was designed from the ground up to be easy to scale to multiple distributed servers. Two of the biggest problems in distributed database design are distributed join operations and distributed transactions. Both of these operations are complex to implement, and can yield poor performance or even downtime in the event that a server becomes unreachable. By “punting” on these problems and not supporting joins or multidocument transactions at all, MongoDB has been able to implement an automatic sharding solution with much better scaling and performance characteristics than you’d normally be stuck with if you had to take relational joins and transactions into account.

Using this approach, we introduce the possibility that Jenny could be removed from the contacts collection but have her numbers remain in the numbers collection. There’s also the possibility that another process reads the database after Jenny’s been removed from the contacts collection, but before her numbers have been removed. On the other hand, if we use the embedded schema, we can remove Jenny from our database with a single operation:

db.contacts.remove({'_id': 3})

Note

One point of interest is that many relational database systems relax the requirement that transactions be completely isolated from one another, introducing various isolation levels. Thus, if you can structure your updates to be single-document updates only, you can get the effect of the serialized (most conservative) isolation level without any of the performance hits in a relational database system.

Referencing for Flexibility

In many cases, embedding is the approach that will provide the best performance and data consistency guarantees. However, in some cases, a more normalized model works better in MongoDB. One reason you might consider normalizing your data model into multiple collections is the increased flexibility this gives you in performing queries.

For instance, suppose we have a blogging application that contains posts and comments. One approach would be to use an embedded schema:

{
  "_id": "First Post",
  "author": "Rick",
  "text": "This is my first post",
  "comments": [
     { "author": "Stuart", "text": "Nice post!" },
     ...
     ]
}

Although this schema works well for creating and displaying comments and posts, suppose we wanted to add a feature that allows you to search for all the comments by a particular user. The query (using this embedded schema) would be the following:

db.posts.find(
    {'comments.author': 'Stuart'},
    {'comments': 1})

The result of this query, then, would be documents of the following form:

{ "_id": "First Post",
  "comments": [
    { "author": "Stuart", "text": "Nice post!" },
    { "author": "Mark", "text": "Dislike!" } ] },
{ "_id": "Second Post",
  "comments": [
    { "author": "Danielle", "text": "I am intrigued" },
    { "author": "Stuart", "text": "I would like to subscribe" } ] }

The major drawback to this approach is that we get back much more data than we actually need. In particular, we can’t ask for just Stuart’s comments; we have to ask for posts that Stuart has commented on, which includes all the other comments on those posts as well. Further filtering would then be required in our Python code:

def get_comments_by(author):
    for post in db.posts.find(
        {'comments.author': author },
        {'comments': 1 }):
        for comment in post['comments']:
            if comment['author'] == author:
                yield post['_id'], comment

On the other hand, suppose we decided to use a normalized schema:

// db.posts schema
{
  "_id": "First Post",
  "author": "Rick",
  "text": "This is my first post"
}

// db.comments schema
{
  "_id": ObjectId(...),
  "post_id": "First Post",
  "author": "Stuart",
  "text": "Nice post!"
}

Our query to retrieve all of Stuart’s comments is now quite straightforward:

db.comments.find({"author": "Stuart"})

In general, if your application’s query pattern is well-known and data tends to be accessed in only one way, an embedded approach works well. Alternatively, if your application may query data in many different ways, or you are not able to anticipate the patterns in which data may be queried, a more “normalized” approach may be better. For instance, in our “linked” schema, we’re able to sort the comments we’re interested in, or restrict the number of comments returned from a query using limit() and skip() operators, whereas in the embedded case, we’re stuck retrieving all the comments in the same order they are stored in the post.

Referencing for Potentially High-Arity Relationships

Another factor that may weigh in favor of a more normalized model using document references is when you have one-to-many relationships with very high or unpredictable arity. For instance, a popular blog with a large amount of reader engagement may have hundreds or even thousands of comments for a given post. In this case, embedding carries significant penalties with it:

  • The larger a document is, the more RAM it uses.
  • Growing documents must eventually be copied to larger spaces.
  • MongoDB documents have a hard size limit of 16 MB.

The problem with taking up too much RAM is that RAM is usually the most critical resource on a MongoDB server. In particular, a MongoDB database caches frequently accessed documents in RAM, and the larger those documents are, the fewer that will fit. The fewer documents in RAM, the more likely the server is to page fault to retrieve documents, and ultimately page faults lead to random disk I/O.

In the case of our blogging platform, we may only wish to display the first three comments by default when showing a blog entry. Storing all 500 comments along with the entry, then, is simply wasting that RAM in most cases.

The second point, that growing documents need to be copied, has to do with update performance. As you append to the embedded comments array, eventually MongoDB is going to need to move the document to an area with more space available. This movement, when it happens, significantly slows update performance.

The final point, about the size limit of MongoDB documents, means that if you have a potentially unbounded arity in your relationship, it is possible to run out of space entirely, preventing new comments from being posted on an entry. Although this is something to be aware of, you will usually run into problems due to memory pressure and document copying well before you reach the 16 MB size limit.

Many-to-Many Relationships

One final factor that weighs in favor of using document references is the case of many-to-many or M:N relationships. For instance, suppose we have an ecommerce system storing products and categories. Each product may be in multiple categories, and each category may contain multiple products. One approach we could use would be to mimic a relational many-to-many schema and use a “join collection”:

// db.product schema
{ "_id": "My Product", ... }

// db.category schema
{ "_id": "My Category", ... }

// db.product_category schema
{ "_id": ObjectId(...),
  "product_id": "My Product",
  "category_id": "My Category" }

Although this approach gives us a nicely normalized model, our queries end up doing a lot of application-level “joins”:

def get_product_with_categories(product_id):
    product = db.product.find_one({"_id": product_id})
    category_ids = [
        p_c['category_id']
        for p_c in db.product_category.find(
            { "product_id": product_id }) ]
    categories = db.category.find({
        "_id": { "$in": category_ids } })
    return product, categories

Retrieving a category with its products is similarly complex. Alternatively, we can store the objects completely embedded in one another:

// db.product schema
{ "_id": "My Product",
  "categories": [
      { "_id": "My Category", ... }
      ...] }

// db.category schema
{ "_id": "My Category",
  "products": [
    { "_id": "My Product", ... }
    ...] }

Our query is now much simpler:

def get_product_with_categories(product_id):
    return db.product.find_one({"_id": product_id})

Of course, if we want to update a product or a category, we must update it in its own collection as well as every place where it has been embedded into another document:

def save_product(product):
    db.product.save(product)
    db.category.update(
        { 'products._id': product['_id'] },
        { '$set': { 'products.*': product } },
        multi=True)

For many-to-many joins, a compromise approach is often best, embedding a list of _id values rather than the full document:

// db.product schema
{ "_id": "My Product",
  "category_ids": [ "My Category", ... ] }

// db.category schema
{ "_id": "My Category" }

Our query is now a bit more complex, but we no longer need to worry about updating a product everywhere it’s included in a category:

def get_product_with_categories(product_id):
    product = db.product.find_one({"_id": product_id})
    categories = list(db.category.find({
        '_id': {'$in': product['category_ids']} }))
    return product, categories

Conclusion

Schema design in MongoDB tends to be more of an art than a science, and one of the earlier decisions you need to make is whether to embed a one-to-many relationship as an array of subdocuments or whether to follow a more relational approach and reference documents by their _id value.

The two largest benefits to embedding subdocuments are data locality within a document and the ability of MongoDB to make atomic updates to a document (but not between two documents). Weighing against these benefits is a reduction in flexibility when you embed, as you’ve “pre-joined” your documents, as well as a potential for problems if you have a high-arity relationship.

Ultimately, the decision depends on the access patterns of your application, and there are fewer hard-and-fast rules in MongoDB than there are in relational databases. Using wisely the flexibility that MongoDB gives you in schema design will help you get the most out of this powerful nonrelational database.

Get MongoDB Applied Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.