Sit Atop the Shoulders of Giants

The field of database technology is large, mature, and there is ongoing academic research on storing, searching, and making sense of data stored in a database. The commercial Oracle database was first introduced in 1979. The first version of PostgreSQL, the best choice at the time of this writing for an open source RDBMS, appeared in 1989. An unparalleled amount of research and development has gone into these products, and best-of-breed choices in this area have not changed every few years they way they have for web scripting languages and frameworks.

If you treat the database simply as a place to dump your application’s data for later retrieval, you are shortchanging yourself and your application. When used correctly, not only will your database safeguard your data from the effects of errant code, but it will also afford you aggregation, computation, and retrieval speed that you could never hope to reproduce with even the cleanest or most elegant application code.

It behooves a web application developer to learn not only the ins and outs of the web framework but also the RDBMS atop which that framework sits. Remember that your database, which contains your company’s most precious asset—its data—is very likely to outlive the application you write on top of it.

Choosing the Right RDBMS

All examples in this book assume the use of PostgreSQL. For those using Oracle or another database that adheres closely to the SQL standard, the concepts are identical, although some of the syntax may vary slightly. Many features of the SQL standard are not implemented in MySQL, so unfortunately a number of the advanced topics are not possible to implement using MySQL as of version 5.0. For this reason, it’s not recommended for a serious website, although MySQL is undeniably popular.

Why not MySQL? The Rails core team uses MySQL, and it is certainly more popular within the Rails community than PostgreSQL. This brings up two important questions. First, if MySQL isn’t as good as PostgreSQL, why is it so popular? And if PostgreSQL isn’t as popular as MySQL, why is it used for the examples in this book?

MySQL gained popularity for two important reasons. First, although it is open source and free, a company called MySQL AB got behind it to offer support to enterprise customers. When the shift from closed to open source software began, having this type of insurance was key to adoption of open source products. For whatever reason, in the early days, PostgreSQL did not have the same level of corporate support offerings as did MySQL—although today there is plenty of support from a variety of vendors.

The second reason for MySQL’s ascension is that it always had much simpler point-and-click installers on Windows, whereas PostgreSQL remained, for a long time, the domain of Unix and Linux users. In the PostgreSQL community this made sense, because databases are hosted on *NIX servers—why would you need point-and-click Windows installers? It should be a lesson, then, that any barriers you erect to users using your software will be to your detriment. Of course, many developers and decision makers worked on Windows, and their inability to easily give PostgreSQL a test run often pushed them to MySQL.

So MySQL won the popularity contest—why don’t we accept that in this book? It’s for the same reason that *NIX is the de facto choice for servers, even as Windows dominates the desktop market. PostgreSQL is simply better at doing the job of being an RDBMS. In addition to implementing much more of the SQL standard, and more faithfully, PostgreSQL also has a much better query planner than MySQL. In addition, MySQL has a variety of strange vestiges from its early days, such as the number value zero being treated as equal to the absence of a value, NULL. MySQL is also case-insensitive by default. Idiosyncrasies like this seem small at the outset, but often come back later to haunt you.

Working with PostgreSQL instills a feeling of safety that MySQL does not. If you haven’t made the switch yet, it’s worth trying PostgreSQL out as you read this book.

A Note on Migrations

One of the strengths of the ActiveRecord Object Relational Mapping (ORM) library is that it provides an abstraction layer between the application developer and the database for the Data Manipulation Language (DML) components of SQL. With some exceptions, this abstraction provides a convenient interface to most inserts, updates, and deletes that your application will need to perform, and you won’t have to worry about syntax peculiarities specific to a particular database product. In many situations, the abstraction is fully adequate, and for the exceptional cases, you can always execute arbitrary SQL to get the job done.

For anyone who has written a database-backed website without an ORM, it’s probably not the absence of SQL, which is a supremely straightforward language, that makes ActiveRecord worthwhile. The real benefit of ActiveRecord is the automatic unmarshalling of results from SQL queries into Ruby objects, an otherwise tedious, manual task that can be painstaking and error-prone. The code in Example 4-1 to load a user record, which you would never write in a Ruby on Rails application, is an approximation of what life is like without an ORM.

Example 4-1. A dramatization of what accessing a database might look like without object-relational mapping tools

db_result = ActiveRecord::Base.select_one("
  select first_name, last_name, birthdate, favorite_food
    from users
   where id = #{id}
")
user = User.new
user.id = id
user.first_name = db_result[:first_name]
user.last_name = db_result[:last_name]
user.birthdate = db_result[:birthdate]
user.favorite_food = db_result[:favorite_food]

With ActiveRecord, the same is accomplished with a single statement:

user = User.find(id)

Abstracting DML provides a fantastic reduction in the amount of rote code that must be written (and rote code that seldom needs special cases, at that). However, one less successful consequence of the desire to fully abstract the database layer has been an attempt to abstract the Data Definition Language (DDL).

DDL statements are those that define tables and sequences, create indexes on tables, and define stored procedures. While DML statements occur throughout an application with every insert, update, or delete statement, DDL statements generally do not appear within applications at all. DDL statements, because they define the structure of your data layer, get executed only once, usually when your application isn’t even running.

Therefore, the attempt to abstract DDL through migrations has not been the boon that abstracting DML has been. The first reason is the lack of benefit gained from using migrations rather than writing straight DDL. Whereas abstracting DML provides a huge savings in the amount of repetitive code that must be written to create objects, as shown earlier, migrations don’t provide any such benefit.

The second reason to be wary of migrations is that as of this writing, they support only a small subset of the DDL language. Just as with DML, you can always mix in some custom DDL with your Ruby migration code, but in this case, the consequence is that you’ll be writing much more code altogether, and the result will be much less succinct than if you had written it all with DDL statements.

Because migrations have had a number of shortcomings, they’re continuously changing. Rails 2.1 solves the problem caused by multiple developers trying to write migrations at once; pre 2.1, migrations were named in a way that required lots of developer communication. While it’s likely that migrations will continue to change, it’s not guaranteed they will ever be a good replacement for DDL, which, as it happens, was designed specifically the purpose it serves: data definition.

Therefore, in this book, we forget about migrations and built and manipulate our schemas using plain old SQL DDL statements. Every developer should understand SQL DDL, and if your organization uses migrations, it shouldn’t be hard to learn the state-of-the-art in migrations, and implement your DDL that way. The important thing is understanding what is supposed to be happening beneath the scenes, so you can still write DDL when you need to. Many operations that are easy with DDL are still—and some will always be—impossible with migrations. If you rely on migrations and skip the DDL that migrations don’t support, you’re shortchanging yourself.

Dispelling Myths

There is a camp of web developers who will tell you the topics in the following three chapters, all of which deal with different types of referential integrity, are unnecessary overhead. They will tell you that application level checks are sufficient to protect your data, and that database-level constraints are sure to make your application slow. It turns out that whenever I meet someone from this camp, it just so happens that they are die-hard MySQL users. Not surprising, because until recently, MySQL did not support referential integrity. MySQL, before version 5.0, was not a relational database management system, but rather simply a database management system. What these people are actually telling you is that RDBMSs are no better than DBMSs, and when the idea is framed that way, it becomes clear that this camp simply doesn’t fully understand why relational databases exist in the first place.

When MySQL came on the scene, it had wild success because it was well-packaged and easy to install not only on Linux, but on Windows, too. MySQL and PHP swept the Internet community because they were so easy to set up, and you could be up and running and writing web pages in just a few hours.[1] As developers got used to their tools, they got used to the deficiencies in those tools, too. So the lacking of a feature became a feature in and of itself. I have actually heard very highly paid consultants say, “MySQL doesn’t support referential integrity because you don’t need it!”

Since enterprise-level applications do need referential integrity, as well as many other features available in other more mature RDBMSs (e.g., views, transactions, triggers, isolation, etc.), MySQL has added support for it in version 5.0. Therefore, the argument that you don’t need it because MySQL doesn’t have it no longer holds water. MySQL now has it because you need it.

Another myth worth dispelling is that referential integrity is just training wheels that you should take off when your application is in production. This type of thinking could not be more backward. Referential integrity constraints certainly do help you find your application’s bugs, but it’s foolhardy to think you’ll find them all before you decide it’s time to throw users at the system. This warning holds doubly true when you start releasing updates once your application is already live. Testing every conceivable use-case, including full regression testing for each release, is next to impossible. On the other hand, your users—including Internet bots, both neutral and malicious—will make your application run the gauntlet. They will find use cases you never imagined. Production is not the place where you want to find referential integrity bugs. But worse, production is also not the place where you want referential integrity bugs to go unnoticed. Your database constraints and referential integrity checks are the last line of defense protecting your data before things go awry. If you want to keep your data intact, you want your constraints to be as complete as they can be.



[1] Replace “MySQL and PHP” in this sentence with “Rails.” In the Rails community, we must be careful not to fall into the same trap of defining as good all of those things that are in Rails and defining as bad all of those things that are not currently in Rails.

Get Enterprise Rails now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.