Before going any further, there’s an important point I need to explain, because it underpins everything else to be discussed in this book. The relational model is, of course, a data model. Unfortunately, however, this latter term has two quite distinct meanings in the database world. The first and more fundamental one is this:
Definition: A data model (first sense) is an abstract, self-contained, logical definition of the data structures, data operators, and so forth, that together make up the abstract machine with which users interact.
This is the meaning we have in mind when we talk about the relational model in particular. And, armed with this definition, we can usefully, and importantly, go on to distinguish a data model in this first sense from its implementation, which can be defined as follows:
Definition: An implementation of a given data model is a physical realization on a real machine of the components of the abstract machine that together constitute that model.
Let me illustrate these definitions in terms of the relational model specifically. First of all, consider the concept relation itself. That concept is part of the model: Users have to know what relations are, they have to know they’re made up of tuples and attributes, they have to know how to interpret them, and so on. All that’s part of the model. But they don’t have to know how relations are physically stored on the disk, or how individual data values are physically encoded, or what indexes or other access paths exist; all that’s part of the implementation, not part of the model.
Or consider the concept join: Users have to know what a join is, they have to know how to invoke a join, they have to know what the result of a join looks like, and so on. Again, all that’s part of the model. But they don’t have to know how joins are physically implemented, or what expression transformations take place under the covers, or what indexes or other access paths are used, or what physical I/O operations occur; all that’s part of the implementation, not part of the model.
And one more example: Candidate keys (keys for short) are, again, part of the model, and users definitely have to know what keys are; in particular, they have to know that such keys have the property of uniqueness. Now, key uniqueness is typically enforced in today’s systems by means of what’s called a “unique index”; but indexes in general, and unique indexes in particular, aren’t part of the model, they’re part of the implementation. Thus, a unique index mustn’t be confused with a key in the relational sense, even though the former might be used to implement the latter (more precisely, to implement some key constraint—see Chapter 8).
In a nutshell, then:
The model (first meaning) is what the user has to know.
The implementation is what the user doesn’t have to know.
Please understand that I’m not saying here that users aren’t allowed to know about the implementation; I’m just saying they don’t have to. In other words, everything to do with implementation should be, at least potentially, hidden from the user.
Here are some important consequences of the foregoing definitions. First of all, observe that everything to do with performance is fundamentally an implementation issue, not a model issue. This point is widely misunderstood! For example, we often hear remarks to the effect that “joins are slow.” But such remarks simply make no sense. Join is part of the model, and the model as such can’t be said to be either fast or slow; only implementations can be said to possess any such quality. Thus, we might reasonably say that some specific product X has a faster or slower implementation of some specific join, on some specific data, than some other specific product Y does—but that’s about all.
Now, I don’t want to give the wrong impression here. It’s true that performance is fundamentally an implementation issue; however, that doesn’t mean a good implementation will perform well if you use the model badly. Indeed, that’s precisely one of the reasons why you need to know the model: so you won’t use it badly. If you write an expression such as S JOIN SP, you’re within your rights to expect the system to implement it efficiently; but if you insist on, in effect, hand coding the join yourself, perhaps like this (pseudocode)—
do for all tuples in S ; fetch S tuple into TS , TN , TT , TC ; do for all tuples in SP with SNO = TS ; fetch SP tuple into TS , TP , TQ ; emit TS , TN , TT , TC , TP , TQ ; end ; end ;
—then there’s no way you’re going to get good performance. Recommendation: Don’t do this. Relational systems shouldn’t be used like simple access methods.
By the way, these remarks about performance apply to SQL too. Like the relational operators (join and the rest), SQL as such can’t be said to be fast or slow—only implementations can sensibly be described in such terms—but it’s also possible to use SQL in such a way as to guarantee bad performance. Although I’ll generally have little to say about performance in this book, therefore, I will occasionally point out certain performance implications of what I’m recommending.
Aside: I’d like to elaborate for a moment on this matter of performance. By and large, my recommendations in this book are never based on performance as a prime motivator; after all, it has always been an objective of the relational model to take performance concerns out of the hands of the user and put them into the hands of the system instead. However, it goes without saying that this objective hasn’t yet been fully achieved, and so (as I’ve already said) the goal of using SQL relationally must sometimes be compromised in the interest of achieving satisfactory performance. That’s another reason why, as I said earlier in this chapter, the overriding rule has to be: You can do what you like, so long as you know what you’re doing. End of aside.
Back to model vs. implementation, and points arising from that distinction: The second point is that, as you probably realize, it’s precisely the separation of model and implementation that allows us to achieve physical data independence. Physical data independence—not a great term, by the way, but we seem to be stuck with it—means we have the freedom to make changes in the way the data is physically stored and accessed without having to make corresponding changes in the way the data is perceived by the user. Now, the reason we might want to change those storage and access details is, typically, performance; and the fact that we can make such changes without having to change the way the data looks to the user means that existing programs, queries, and the like can all still work after the change. Very importantly, therefore, physical data independence means protecting investment in user training and applications (investment in logical database design also, I might add).
It follows from all of the above that, as previously indicated, indexes, and indeed physical access paths of any kind, are properly part of the implementation, not the model; they belong under the covers and should be hidden from the user. (Note that access paths as such are nowhere mentioned in the relational model.) For the same reasons, they should be rigorously excluded from SQL also. Recommendation: Avoid the use of any SQL construct that violates this precept. (Actually there’s nothing in the standard that does, so far as I’m aware, but I know the same isn’t true of certain SQL products.)
Anyway, as you can see from the foregoing definitions, the distinction between model and implementation is really just a special case—a very important special case—of the familiar distinction between logical and physical considerations in general. Sadly, however, most of today’s SQL systems don’t make those distinctions as clearly as they should. As a direct consequence, they deliver far less physical data independence than they should, and far less than, in principle, relational systems are capable of. I’ll come back to this issue in the next section.
Definition: A data model (second sense) is a model of the data—especially the persistent data—of some particular enterprise.
In other words, a data model in the second sense is just a (logical, and possibly somewhat abstract) database design. For example, we might speak of the data model for some bank, or some hospital, or some government department.
Having explained these two different meanings, I’d like to draw your attention to an analogy that I think nicely illuminates the relationship between them:
A data model in the first sense is like a programming language, whose constructs can be used to solve many specific problems but in and of themselves have no direct connection with any such specific problem.
A data model in the second sense is like a specific program written in that language—it uses the facilities provided by the model, in the first sense of that term, to solve some specific problem.
By the way, it follows from all of the above that if we’re talking about data models in the second sense, then we might reasonably speak of “relational models” in the plural, or “a” relational model (with an indefinite article). But if we’re talking about data models in the first sense, then there’s only one relational model, and it’s the relational model (with the definite article). I’ll have more to say on this latter point in Appendix A.
For the remainder of this book I’ll use the term data model, or more usually just model for short, exclusively in its first sense.
 More than one reviewer observed that this sentence didn’t make sense (how can a system be used as a method?). Well, if you’re too young to be familiar with the term access method, then I envy you; but the fact is, that term, inappropriate though it certainly was (and is), was widely used for many years to mean a simple record level I/O facility, of one kind or another.