Chapter 4

No Duplicates, No Nulls

I haven’t even mentioned yet the way the silly notions

Discussed so far interreact and lead us into oceans

Of complication and despond and general distress.

Are two nulls equal (duplicates)? I fear, both NO and YES.

—Anon.: Where Bugs Go

In the previous chapter, I said the following (approximately):

  • Relations never contain duplicate tuples, because the body of a relation is a set (a set of tuples) and sets in mathematics don’t contain duplicate elements.

  • Relations never contain nulls, because the body of a relation is a set of tuples, and tuples in turn never contain nulls.

I also suggested that since there was so much to be said about these topics, it was better to devote a separate chapter to them. This is that chapter. Note: By definition, the topics in question are SQL topics, not relational ones; in this chapter, therefore, I’ll use the terminology of SQL rather than that of the relational model (for the most part, at any rate).

WHAT’S WRONG WITH DUPLICATES?

There are numerous practical arguments in support of the position that duplicate rows (“duplicates” for short) should be prohibited. Here I want to emphasize just one—but I think it’s a powerful one.1 However, it does rely on certain notions I haven’t discussed yet in this book, so I need to make a couple of preliminary assumptions:

  1. I assume you know that relational DBMSs include a component called the optimizer,2 whose job is to try to figure out the best way to implement user queries and the like (where “best” basically means best performing).

  2. I assume you also know that one of the things optimizers do is what’s sometimes called query rewrite. Query rewrite is the process of transforming some relational expression exp1 (representing some user query, say) into another such expression exp2, such that exp1 and exp2 are guaranteed to produce the same result when evaluated but exp2 performs better than exp1 (at least, we hope so). Note: Be aware, however, that the term query rewrite is also used in certain commercial products with a different, typically more limited meaning.

Now I can present my argument. The fundamental point I want to make is that certain expression transformations, and hence certain optimizations, that would be valid if SQL were truly relational aren’t valid if duplicate rows are allowed. By way of example, consider the (nonrelational) database shown in Fig. 4.1. Note right away that the tables in that database have no keys (and hence no primary keys a fortiori, which is why there’s no double underlining in the figure). And by the way: If you’re thinking the database is totally unrealistic—and especially if you’re thinking that because of that fact you’re not going to be convinced by the arguments that follow—I politely request that you suspend judgment until you’ve seen the further discussion of this example at the beginning of the next section, “Duplicates: Further Issues.”

image

Fig. 4.1: A nonrelational database, with duplicates

Before going any further, perhaps I should ask the question: What does it mean to have three (P1,Screw) rows in table P and not two, or four, or seventeen? It must mean something, for if it means nothing, then why are the duplicates there in the first place? As I once heard Ted Codd say: If something is true, saying it twice doesn’t make it any more true.3

So I have to assume there’s some meaning attached to the duplication, even though that meaning, whatever it is, is hardly very explicit. Given that duplicates do have some meaning, therefore, there are presumably going to be business decisions made on the basis of the fact that, for example, there are three (P1,Screw) rows in table P and not two or four or seventeen. For if not, then (to repeat) why are the duplicates there in the first place?

Aside: In fact the foregoing paragraph touches on another point: namely, that duplicates violate one of the original objectives of the relational model. The objective in question is explicitness; that is, the meaning of the data in the database should be as explicit and obvious as possible (since databases are supposed to be suitable for sharing among a wide variety of disparate users and applications). As we’ve just seen, however, the presence of duplicates strongly suggests that part of the meaning of that data is not explicit but hidden. In fact, duplicates can be regarded as violating one of the most fundamental relational principles of all: viz., The Information Principle (to be discussed in Appendix A). End of aside.

Now consider the following query on the database of Fig. 4.1: “Get part numbers for parts that either are screws or are supplied by supplier S1, or both.” Here are some candidate SQL formulations for this query, together with the result produced in each case:

  1. SELECT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    OR     P.PNO IN
         ( SELECT SP.PNO
           FROM   SP
           WHERE  SP.SNO = 'S1' )

    Result: P1 * 3, P2 * 1.

  2. SELECT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'
    OR     SP.PNO IN
         ( SELECT P.PNO
           FROM   P
           WHERE  P.PNAME = 'Screw' )

    Result: P1 * 2, P2 * 1.

  3. SELECT P.PNO
    FROM   P , SP
    WHERE  ( SP.SNO = 'S1' AND
             SP.PNO = P.PNO )
    OR     P.PNAME = 'Screw'

    Result: P1 * 9, P2 * 3.

  4. SELECT SP.PNO
    FROM   P , SP
    WHERE  ( SP.SNO = 'S1' AND
             SP.PNO = P.PNO )
    OR     P.PNAME = 'Screw'

    Result: P1 * 8, P2 * 4.

  5. SELECT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    UNION  ALL
    SELECT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'

    Result: P1 * 5, P2 * 2.

  6. SELECT DISTINCT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    UNION  ALL
    SELECT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'

    Result: P1 * 3, P2 * 2.

  7. SELECT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    UNION  ALL
    SELECT DISTINCT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'

    Result: P1 * 4, P2 * 2.

  8. SELECT DISTINCT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    OR     P.PNO IN
         ( SELECT SP.PNO
           FROM   SP
           WHERE  SP.SNO = 'S1' )

    Result: P1 * 1, P2 * 1.

  9. SELECT DISTINCT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'
    OR     SP.PNO IN
         ( SELECT P.PNO
           FROM   P
           WHERE  P.PNAME = 'Screw' )

    Result: P1 * 1, P2 * 1.

  10. SELECT P.PNO
    FROM   P
    GROUP  BY P.PNO , P.PNAME
    HAVING P.PNAME = 'Screw'
    OR     P.PNO IN
         ( SELECT SP.PNO
           FROM   SP
           WHERE  SP.SNO = 'S1' )

    Result: P1 * 1, P2 * 1.

  11. SELECT P.PNO
    FROM   P , SP
    GROUP  BY P.PNO , P.PNAME , SP.SNO , SP.PNO
    HAVING ( SP.SNO = 'S1' AND
             SP.PNO = P.PNO )
    OR     P.PNAME = 'Screw'

    Result: P1 * 2, P2 * 2.

  12. SELECT P.PNO
    FROM   P
    WHERE  P.PNAME = 'Screw'
    UNION
    SELECT SP.PNO
    FROM   SP
    WHERE  SP.SNO = 'S1'

    Result: P1 * 1, P2 * 1.

Aside: Actually, certain of the foregoing formulations—which?—are a little suspect, because they effectively assume that every screw is supplied by at least one supplier. But this fact makes no material difference to the argument that follows. End of aside.

The first point to notice, then, is that the twelve different formulations produce nine different results: different, that is, with respect to their degree of duplication. (By the way, I make no claim that the twelve different formulations and the nine different results are the only ones possible; indeed, they aren’t, in general.) Thus, if the user really cares about duplicates, then he or she needs to be extremely careful in formulating the query in such a way as to obtain exactly the desired result.

Furthermore, analogous remarks apply to the system itself: Because different formulations can produce different results, the optimizer too has to be extremely careful in its task of expression transformation. For example, the optimizer isn’t free to transform, say, formulation 1 into formulation 12 or the other way around, even if it would like to. In other words, duplicate rows act as a significant optimization inhibitor. Here are some implications of this fact:

  • The optimizer code itself is harder to write, harder to maintain, and probably more buggy— all of which combine to make the product more expensive and less reliable, as well as later in delivery to the marketplace, than it might be.

  • System performance is likely to be worse than it might be.

  • Users are going to have to get involved in performance issues. To be more specific, they’re going to have to spend time and effort in figuring out how to formulate a given query in order to get the best performance—a state of affairs that (as noted in Chapter 1) the relational model was expressly intended to avoid.

The fact that duplicates serve as an optimization inhibitor is particularly frustrating in view of the fact that, in most cases, users probably don’t care how many duplicates appear in the result. In other words: Different formulations produce different results; however, the differences are probably irrelevant from the user’s point of view; but the optimizer is unaware of this latter fact and is therefore prevented, unnecessarily, from performing the transformations it might like to perform.

DUPLICATES: FURTHER ISSUES

There’s much, much more that could be said regarding duplicates and what’s wrong with them, but in this section I’ll limit myself to just three further points. The first has to do with the fact that in practice (as I mentioned earlier) base tables, at least, almost never do contain duplicate rows, and hence that the example in the previous section might reasonably be regarded as unrealistic. Well, all right; but the trouble is, SQL can generate duplicates in query results. Indeed, different formulations of “the same” query can produce results with different degrees of duplication, even if the input tables themselves have no duplicates at all. By way of illustration, let’s see what happens to that example from the previous section if we revise the database to make the base tables duplicate free, as in Fig. 4.2 (thanks to a reader of the previous edition, Ed Hynes, for drawing this example to my attention):

image

Fig. 4.2: A relational database, without duplicates

Now, in the previous section I showed twelve different formulations of the query “Get part numbers for parts that either are screws or are supplied by supplier S1, or both” against the database of Fig. 4.1. Well, here are the results produced by those same twelve formulations against the revised version of the database in Fig. 4.2 (Exercise: Check these!):

  1. P1 * 3, P2 * 1

  2. P1 * 2, P2 * 1

  3. P1 * 5, P2 * 3

  4. P1 * 6, P2 * 2

  5. P1 * 2, P2 * 2

  6. P1 * 2, P2 * 2

  7. P1 * 2, P2 * 2

  8. P1 * 1, P2 * 1

  9. P1 * 1, P2 * 1

  10. P1 * 3, P2 * 1

  11. P1 * 5, P2 * 3

  12. P1 * 1, P2 * 1

As you can see, the twelve formulations still produce several different results (results, that is, that differ with respect to their degree of duplication). As I claimed above, therefore, it’s clear that even if the input tables themselves don’t contain any duplicates, different formulations of the same query can produce results with different degrees of duplication, and optimization is thus still inhibited. So the message is: Making sure that base tables never contain any duplicate rows is necessary but not sufficient to avoid duplicate rows entirely.

At the risk of beating a dead horse, I’d like to pursue this point just a moment longer and consider a much simpler example (I didn’t lead with this example because it’s almost too simple, a fact that can make it easy to miss the real significance of what’s going on). Here are two possible formulations of the query “Get supplier numbers for suppliers who supply at least one part” on our usual suppliers-and-parts database (and note that this time the input tables most definitely don’t contain any duplicates):

SELECT SNO                        SELECT SNO
FROM   S                          FROM   S NATURAL JOIN SP
WHERE  SNO IN              
     ( SELECT SNO          
       FROM   SP )         

At least one of these expressions—which?—will produce a result with duplicates, in general. (Exercise: Given our usual sample data values, what results do the two expressions produce?)

So what do we conclude from examples like the ones above and the one discussed in the previous section? Well, what I’d like to conclude is that you should abide by the following suggestions (and if you do, you can then just forget about the duplicates problem entirely):

  • First, never allow duplicates in base tables (by always specifying at least one key—see Chapter 5).

  • Second, ensure that query results never contain duplicates (for example, by always specifying DISTINCT in your SQL queries).

Unfortunately, however, life is never quite as simple as we might like, and the second of these suggestions, at least, needs more discussion and explanation. But let me leave it at that for now; I’ll come back and revisit it in the next section (“Avoiding Duplicates in SQL”).

I turn now to my second point. The fact is, there’s another at least psychological argument against duplicates that I think is quite persuasive (thanks to Jonathan Gennick for this one): If, in accordance with the n-dimensional perspective on relations discussed in Chapter 3, you think of a table as a plot of points in some n-dimensional space, then duplicate rows clearly don’t add anything—they simply amount to plotting the same point twice.

My final point is this. Suppose table T does permit duplicates. Then we can’t tell the difference between “genuine” duplicates in T and duplicates that arise from errors in data entry on T! For example, suppose the person responsible for data entry unintentionally enters the very same row twice—e.g., by inadvertently hitting the return key twice (easily done, by the way). Then there’s no straightforward way to delete the “second” row without deleting the “first” as well. Note that we presumably do want to delete that “second” row, since it shouldn’t have been entered in the first place.

AVOIDING DUPLICATES IN SQL

The relational model prohibits duplicates; to use SQL relationally, therefore, steps must be taken to prevent them from occurring. Now, if every base table has at least one key (see Chapter 5), then duplicates will never occur in base tables as such. As we’ve seen, however, certain SQL expressions can still yield result tables with duplicates. Here are some of the cases in which such tables can be produced:

  • SELECT ALL

  • UNION ALL

  • VALUES (i.e., table value constructor invocations)

Regarding VALUES, see Chapter 3. Regarding ALL, note first that this keyword (and its alternative, DISTINCT) can be specified:

  • In a SELECT clause, immediately following the SELECT keyword

  • In a union, intersection, or difference, immediately following the applicable keyword (UNION, INTERSECT, and EXCEPT, respectively)

  • Inside the parentheses in an invocation of a “set function” such as SUM, immediately preceding the argument expression4

Note: DISTINCT is the default for UNION, INTERSECT, and EXCEPT; ALL is the default in the other cases.

Now, the “set function” case is special; you must specify ALL, at least implicitly, if you want the function to take duplicate values into account, which sometimes you do (see Chapter 7). But the other cases have to do with elimination of duplicate rows, which must always be done, at least in principle, if you want to use SQL relationally. Thus, the obvious recommendations in those cases are: Always specify DISTINCT; preferably do so explicitly; and never specify ALL. Then you can just forget about duplicate rows entirely.

In practice, however (and as previously noted), matters aren’t quite that simple. Why not? Well, I don’t think I can do better here than repeat the essence of what I wrote in this book’s predecessor (Database in Depth, O’Reilly Media Inc., 2005):

At this point in the original draft, I added that if you find the discipline of always specifying DISTINCT annoying, don’t complain to me—complain to the SQL vendors instead. But my reviewers reacted with almost unanimous horror to my suggestion that you should always specify DISTINCT. One wrote: “Those who really know SQL well will be shocked at the thought of coding SELECT DISTINCT by default.” Well, I’d like to suggest, politely, that (a) those who are “shocked at the thought” probably know the implementations well, not SQL, and (b) their shock is probably due to their recognition that those implementations do such a poor job of optimizing away unnecessary DISTINCTs.5 If I write SELECT DISTINCT SNO FROM S ..., that DISTINCT can safely be ignored. If I write either EXISTS (SELECT DISTINCT ...) or IN (SELECT DISTINCT ...), those DISTINCTs can safely be ignored. If I write SELECT DISTINCT SNO FROM SP ... GROUP BY SNO, that DISTINCT can safely be ignored. If I write SELECT DISTINCT ... UNION SELECT DISTINCT ..., those DISTINCTs can safely be ignored. And so on. Why should I, as a user, have to devote time and effort to figuring out whether some DISTINCT is going to be a performance hit and whether it’s logically safe to omit it?—and to remembering all of the details of SQL’s inconsistent rules for when duplicates are automatically eliminated and when they’re not?

Well, I could go on. However, I decided—against my own better judgment, but in the interest of maintaining good relations (with my reviewers, I mean)—not to follow my own advice elsewhere in this book but only to request duplicate elimination explicitly when it seemed to be logically necessary to do so. It wasn’t always easy to decide when that was, either. But at least now I can add my voice to those complaining to the vendors, I suppose.

So the recommendation (sadly) boils down to this: First, make sure you know when SQL eliminates duplicates without you asking it to. Second, in those cases where you do have to ask, make sure you know whether it matters if you don’t. Third, in those cases where it matters, specify DISTINCT (but, as Hugh Darwen once said, be annoyed about it). And never specify ALL!

WHAT’S WRONG WITH NULLS?

The opening paragraph from the section “What’s Wrong with Duplicates?” applies equally well here, with just one tiny text substitution, so I’ll basically just repeat it: There are numerous practical arguments in support of the position that nulls should be prohibited. Here I want to emphasize just one—but I think it’s a powerful one. But it does rely on certain notions I haven’t discussed yet in this book, so I need to make a couple of preliminary assumptions:

  1. I assume you know that any comparison in which at least one of the comparands is null evaluates to the UNKNOWN truth value instead of TRUE or FALSE. The justification for this state of affairs is the intended interpretation of null as value unknown: If the value of A is unknown, then it’s also unknown whether, for example, A > B, regardless of the value of B (even—perhaps especially—if the value of B is unknown as well). Note: That same state of affairs is also the source of the term three-valued logic (3VL). That is, the notion of nulls, as understood in SQL, inevitably leads to a logic in which there are three truth values instead of the usual two. (The relational model, by contrast, is based on conventional two-valued logic, 2VL.)

  2. I assume you’re also familiar with the 3VL truth tables for the familiar logical operators— also known as connectives—NOT, AND, and OR (using T, U, and F to stand for TRUE, UNKNOWN, and FALSE, respectively):

    p NOT p       p q p AND q         p q p OR q
    ──┼───────     ─────┼─────────       ─────┼────────
    T    F         T T     T            T T    T
    U    U         T U     U            T U    T
    F    T         T F     F            T F    T
                    U T     U            U T    T
                    U U     U            U U    U
                    U F     F            U F    U
                    F T     F            F T    T
                    F U     F            F U    U
                    F F     F            F F    F

    Observe in particular that NOT returns UNKNOWN if its input is UNKNOWN; AND returns UNKNOWN if one input is UNKNOWN and the other is either UNKNOWN or TRUE; and OR returns UNKNOWN if one input is UNKNOWN and the other is either UNKNOWN or FALSE.

Now I can present my argument. The fundamental point I want to make is that certain boolean expressions—and therefore certain queries in particular—can produce results that are correct according to three-valued logic but not correct in the real world. By way of example, consider the (nonrelational) database shown in Fig. 4.3, in which “the CITY is null” for part P1. Note carefully that the shading in that figure, in the place where the CITY value for part P1 ought to be, stands for nothing at all; conceptually, there’s nothing at all—not even a string of blanks or an empty string—in that position (which means the “tuple” for part P1 isn’t really a tuple, a point I’ll come back to near the end of this section).

image

Fig. 4.3: A nonrelational database, with a null

Consider now the following (admittedly rather contrived) query on the database of Fig. 4.3: “Get (SNO,PNO) pairs where either the supplier and part cities are different or the part city isn’t Paris, or both.” Here’s the obvious SQL formulation of this query:

SELECT S.SNO , P.PNO
FROM   S , P
WHERE  S.CITY <> P.CITY
OR     P.CITY <> 'Paris'

Now I want to focus on the boolean expression in the WHERE clause:

( S.CITY <> P.CITY ) OR ( P.CITY <> 'Paris' )

(I’ve added some parentheses for clarity.) For the only data we have, this expression evaluates to UNKNOWN OR UNKNOWN, which reduces to just UNKNOWN. Now, queries in SQL retrieve data for which the expression in the WHERE clause evaluates to TRUE, not to FALSE and not to UNKNOWN;6 in the example, therefore, nothing is retrieved at all.

But part P1 does have some corresponding city in the real world;7 in other words, “the null CITY” for part P1 does stand for some real value, say c. Now, either c is Paris or it isn’t. If it is, then the expression

( S.CITY <> P.CITY ) OR ( P.CITY <> 'Paris' )

becomes (for the only data we have)

( 'London' <> 'Paris' ) OR ( 'Paris' <> 'Paris' )

which evaluates to TRUE, because the first term evaluates to TRUE. Alternatively, if c isn’t Paris, then the expression becomes (again, for the only data we have)

( 'London' <> c ) OR ( c <> 'Paris' )

which also evaluates to TRUE, because the second term evaluates to TRUE. Thus, the boolean expression is always true in the real world, and the query should therefore return the pair (S1,P1), regardless of what real world value the null stands for. In other words, the result that’s correct according to the logic (meaning, specifically, 3VL) and the result that’s correct in the real world are different!

By way of another example, consider the following query on that same table P from Fig. 4.3 (I didn’t lead with this example because it’s even more contrived than the previous one, but in some ways it makes the point with still more force):

SELECT PNO
FROM   P
WHERE  CITY = CITY

The real world answer here is surely the set of part numbers currently appearing in P (in other words, the set containing just part number P1, given the sample data shown in Fig. 4.3). SQL, however, will return no part numbers at all.

To sum up: If you have any nulls in your database, you’re getting wrong answers to certain of your queries. What’s more, you have no way of knowing, of course, just which queries you’re getting wrong answers to and which not; all results become suspect. You can never trust the answers you get from a database with nulls. In my opinion, this state of affairs is a complete showstopper.

Aside: To all of the above, I can’t resist adding that even though SQL does support 3VL, and even though it does support the keyword UNKNOWN, that keyword does not—unlike the keywords TRUE and FALSE—denote a value of type BOOLEAN, in SQL. (This is just one of the numerous flaws in SQL’s 3VL support; there are many, many others, but most of them are beyond the scope of this book.) To elaborate briefly: As with 2VL, the SQL type BOOLEAN contains just two values, TRUE and FALSE; “the third truth value” is represented, quite incorrectly, by null! Here are some consequences of this fact:

  • Assigning UNKNOWN to a variable B of type BOOLEAN actually sets B to null.

  • After such an assignment, the comparison B = UNKNOWN doesn’t give TRUE— instead, it gives null (meaning, to spell the point out, that SQL apparently believes, or claims, that it’s unknown whether B has the value UNKNOWN). Note, incidentally, that this state of affairs constitutes a clear violation of The Assignment Principle (see Exercise 2.22 in Chapter 2, also Chapter 5).

  • In fact, the comparison B = UNKNOWN always gives null (meaning UNKNOWN), regardless of the value of B, because it’s logically equivalent to the comparison “B = NULL” (not meant to be valid SQL syntax).

To understand the seriousness of such flaws, you might care to meditate on the analogy of a numeric type using null instead of zero to represent zero. End of aside.

As with the business of duplicates earlier, there’s a lot more that could be said on the whole issue of nulls, but I just want to close with a brief look at the formal argument against them. Recall that, by definition, a null isn’t a value. It follows that:

  • A “type” that contains a null isn’t a type (because types contain values).

  • A “tuple” that contains a null isn’t a tuple (because tuples contain values).

  • A “relation” that contains a null isn’t a relation (because relations contain tuples, and tuples don’t contain nulls).

  • In fact, nulls (like duplicates) violate one of the most fundamental relational principles of all—viz., The Information Principle. Once again, see Appendix A for further discussion of that principle.

The net of all this is that if nulls are present, then we’re certainly not talking about the relational model (I don’t know what we are talking about, but it’s not the relational model); the entire edifice crumbles, and all bets are off.

AVOIDING NULLS IN SQL

The relational model prohibits nulls; to use SQL relationally, therefore, steps must be taken to prevent them from occurring. First of all, a NOT NULL constraint should be specified, either explicitly or implicitly, for every column in every base table (see Chapter 5); then nulls will never occur in base tables as such. Unfortunately, however, certain SQL expressions can still yield result tables containing nulls. Here are some of the situations in which nulls can be produced:

  • The SQL “set functions” SUM, MAX, MIN (etc.) all return null if their argument is empty (except for COUNT and COUNT(*), which correctly return zero in such a situation).

  • If a scalar subquery evaluates to an empty table, that empty table is coerced to a null.

  • If a row subquery evaluates to an empty table, that empty table is coerced to a row of all nulls. Note: A row of all nulls and a null row aren’t the same thing at all, logically speaking (another logical difference here, in fact)—yet SQL does think they’re the same thing, at least some of the time. But it would take us much too far afield to get into the detailed implications of that state of affairs here.

  • Outer joins and “union joins” are expressly designed to produce nulls in their result.8

  • If the ELSE option is omitted from a CASE expression, an ELSE option of the form ELSE NULL is assumed.

  • The expression NULLIF(x,y) returns null if x = y evaluates to TRUE.

  • The “referential triggered actions” ON DELETE SET NULL and ON UPDATE SET NULL can both generate nulls (obviously enough).

Strong recommendations:

  • Do specify NOT NULL, at least implicitly, for every column in every base table.

  • Don’t use the keyword NULL in any other context whatsoever (i.e., anywhere other than a NOT NULL constraint or logical equivalent).

  • Don’t use the keyword UNKNOWN in any context whatsoever.

  • Don’t omit the ELSE option from a CASE expression unless you’re certain it would never have been reached anyway.

  • Don’t use NULLIF.

  • Don’t use outer join, and don’t use the keywords OUTER, FULL, LEFT, and RIGHT (except possibly as suggested in the section “A Remark on Outer Join” below).

  • Don’t use union join.

  • Don’t specify either PARTIAL or FULL on MATCH (they have meaning only when nulls are present). For similar reasons, don’t use the MATCH option on foreign key constraints, and don’t use IS [NOT] DISTINCT FROM. (If a and b are both nonnull, then a IS NOT DISTINCT FROM b reduces to a = b, and a IS DISTINCT FROM b reduces to a <> b.)

  • Don’t use IS TRUE, IS NOT TRUE, IS FALSE, or IS NOT FALSE. The reason is that, if bx is a boolean expression, then the following logical equivalences fail to hold only if nulls are present (the symbol “≡” means is equivalent to):

    bx IS TRUE       ≡ bx
    bx IS NOT TRUE   ≡ NOT bx
    bx IS FALSE      ≡ NOT bx
    bx IS NOT FALSE  ≡ bx

    In other words, IS TRUE and the rest are distractions at best, in the absence of nulls.

  • Finally, do use COALESCE on every scalar expression that might “evaluate to null” without it. (Apologies for the quotation marks here, but the fact is that the phrase “evaluates to null” is a solecism.)

Note: In case you’re not familiar with COALESCE, let me elaborate briefly on the last of these recommendations. Essentially, COALESCE is an operator that lets you replace a null by some nonnull value “as soon as it appears” (i.e., before it has a chance to do any significant damage). Here’s the definition: Let a, b, ..., c be scalar expressions. Then the expression COALESCE (a,b,...,c) returns null if its arguments are all null, or the value of its first nonnull argument otherwise. Of course, to use it sensibly, you do need to make sure at least one of a, b, ..., c is nonnull! Here’s a fairly realistic example:

SELECT S.SNO , ( SELECT COALESCE ( SUM ( ALL SP.QTY ) , 0 )
                 FROM   SP
                 WHERE  SP.SNO = S.SNO ) AS TOTQ
FROM   S

In this example, if the SUM invocation “evaluates to null”—which it will do in particular for any supplier that doesn’t have any matching shipments, like supplier S5 in our usual running example—then the COALESCE invocation will replace that null by a zero. (Incidentally, this example also illustrates a situation in which use of ALL instead of DISTINCT isn’t just acceptable but is logically required, though it might be implicit. See Chapter 7.) Given our usual sample data, the query thus produces the following result:

┌─────┬──────┐
SNO TOTQ
├═════┼──────┤
S1   1300
S2    700
S3    200
S4    900
S5      0
└─────┴──────┘

A REMARK ON OUTER JOIN

Outer join is expressly designed to produce nulls in its result and should therefore be avoided, in general. Relationally speaking, it’s a kind of shotgun marriage: It forces two tables into a kind of union—yes, I do mean union, not join—even when the tables in question fail to conform to the usual requirements for union (see Chapter 6). It does this, in effect, by padding one or both of the tables with nulls before doing the union, thereby making them conform to those usual requirements after all. But there’s no reason why that padding shouldn’t be done with proper values instead of nulls, as in this example:

SELECT SNO , PNO FROM SP
UNION
SELECT SNO , 'nil' AS PNO FROM S
WHERE  SNO NOT IN ( SELECT SNO FROM SP )

Result (note the row for supplier S5 in particular):

┌─────┬─────┐
SNO PNO
├═════┼═════┤
S1   P1  
S1   P2  
S1   P3  
S1   P4  
S1   P5  
S1   P6  
S2   P1  
S2   P2  
S3   P2  
S4   P2  
S4   P4  
S4   P5  
S5   nil
└─────┴─────┘

Alternatively, the same result could be obtained by using the explicit SQL outer join operator in conjunction with COALESCE, as here:

SELECT SNO , COALESCE ( PNO , 'nil' ) AS PNO
FROM   S NATURAL LEFT OUTER JOIN SP

Note: I said “there’s no reason” why the padding shouldn’t be done with proper values, as in the foregoing example, but that phraseology is really a little too glib. The example worked— sort of—because attribute PNO is of type CHAR, and so we could pad with a character string value ('nil' in the example). But what if it had been of some numeric type, say INTEGER? Or, worse, some user defined type? And even in the simple character string case, an argument could be made that the result misrepresents the semantics of the situation (does 'nil' truly represent a part number?). The truth is, padding with a real value instead of null just tends to hide the fact that outer join, no matter what code is used to achieve it, is simply not a respectable operation.9 Much better to avoid it altogether.

CONCLUDING REMARKS

There are a few final remarks I want to make regarding nulls and 3VL specifically. Nulls and 3VL are supposed to be a solution to the “missing information” problem—but I believe I’ve shown that, to the extent they can be considered a “solution” at all, they’re a disastrously bad one. Before I leave the topic, however, I’d like to raise, and respond to, an argument that’s often heard in this connection. That argument goes something like this:

All of those examples you give where nulls lead to wrong answers are very artificial. Real world queries aren’t like that! More generally, most of your criticisms seem very academic and theoretical—I bet you can’t show any real practical situations where nulls have given rise to the kinds of problems you worry about, and I bet you can’t prove such practical situations do occur.

Needless to say, I have several responses to this argument. The first is: How do we know nulls haven’t caused real practical problems, anyway? It seems to me that if some serious real world situation—an oil spill, a collapsed bridge, a wrong medical diagnosis—were found to be due to nulls, there might be valid reasons (nontechnical ones, I mean) why the information would never get out. We’ve all heard stories of embarrassing failures caused by software glitches of other kinds, even in the absence of nulls; in my opinion, nulls can only serve to make such failures much more likely.

Second, suppose someone—me, for example—were to go around claiming that some software product or application contained a serious logical error due to nulls. Can you imagine the lawsuits?

Third and most important, I think those of us who criticize nulls don’t need to be defensive, anyway; I think we should stand those counterarguments on their head, as it were. After all, it’s undeniable that nulls can lead to errors in certain cases. So it’s not up to us to prove that those “certain cases” might include practical, real world situations; rather, it’s up to those who want to defend nulls to prove that they don’t. And I venture to suggest that in practice it would be quite difficult, and very likely impossible, to prove any such thing.

Of course, if nulls are prohibited, then missing information will have to be handled by some other means. Unfortunately, those other means are a little too complex, in general, to be discussed in detail here. The SQL mechanism of (nonnull) default values can be used in simple cases; but for a more comprehensive approach to the problem—including in particular an explanation of how you can still get “don’t know” answers when you want them, even from a database without nulls—I refer you to Appendix C.

EXERCISES

4.1 “Duplicates in databases are a good idea in because duplicates occur naturally in the real world. For example, all pennies are duplicates of one another.” How would you respond to this argument?

4.2 Let r be a relation and let bx and by be boolean expressions. Then there’s a law (used in relational systems to help with optimization, among other things) that states that (r WHERE bx) UNION (r WHERE by) ≡ r WHERE bx OR by. If r isn’t a relation but an SQL table with duplicates, does this law still apply?

4.3 Let a, b, and c be sets. Then the distributive law of intersection over union (also used in relational systems to help with optimization among other things) states that a INTERSECT (b UNION c) ≡ (a INTERSECT b) UNION (a INTERSECT c). If a, b, and c are bags instead of sets, does this law still apply?

4.4 Part of the explanation of the FROM clause (in a SELECT - FROM - WHERE expression) in the 1992 version of the standard read as follows:

[The] result of the <from clause> is the ... cartesian product of the tables identified by [the specifications in that <from clause>]. The ... cartesian product, CP, is the multiset of all rows r such that r is the concatenation of a row from each of the identified tables ...

Note, therefore, that CP isn’t well defined!—notwithstanding the fact that the standard did go on to say that “The cardinality of CP is the product of the cardinalities of the identified tables.” For example, let tables T1 and T2 be as shown here:

T1              T2
┌────┐          ┌────┐
C1            C2
├────┤          ├────┤
  0             1
  0             2
└────┘          └────┘

Observe now that all of the following fit the foregoing definition for “the” cartesian product CP of T1 and T2 (that is, any of them could be “the” multiset referred to):

CP1                  CP2                  CP3
┌────┬────┐          ┌────┬────┐          ┌────┬────┐
C1 C2            C1 C2            C1 C2
├────┼────┤          ├────┼────┤          ├────┼────┤
  0   1             0   1             0   1
  0   1             0   1             0   2
  0   1             0   2             0   2
  0   2             0   2             0   2
└────┴────┘          └────┴────┘          └────┴────┘

Can you fix up the wording of the standard appropriately?

4.5 Consider the following SQL cursor definition:

DECLARE X CURSOR FOR SELECT SNO , QTY FROM SP ;

Note that (a) cursor X permits updates, (b) the table visible through cursor X permits duplicates, but (c) the underlying table SP doesn’t (permit duplicates, that is). Now suppose the operation DELETE ... WHERE CURRENT OF X is executed. Then there’s no way, in general, of saying which specific row of table SP is deleted by that operation. How would you fix this problem?

4.6 Please write out one googol times: There’s no such thing as a duplicate. Note: A googol is one followed by 100 zeros (i.e., 10 to the hundredth power). A googolplex is one followed by a googol zeros (i.e., 10 to the “googolth” power).

4.7 Do you think nulls occur naturally in the real world?

4.8 There’s a logical difference between null and the third truth value: True or false? (Perhaps I should ask: True, false, or unknown?)

4.9 In the body of the chapter, I gave truth tables for one monadic 3VL connective (NOT) and two dyadic 3VL connectives (AND and OR), but there are many other connectives as well (see Exercise 4.10 below). Another useful monadic connective is MAYBE,10 with truth table as follows:

P MAYBE p
──┼─────────
T     F
U     T
F     F

Does SQL support this connective?

4.10 Following on from the previous exercise, how many distinct connectives are there altogether in 2VL? What about 3VL? What do you conclude from your answers to these questions?

4.11 A logic is truth functionally complete if it supports, directly or indirectly, all possible connectives. Truth functional completeness is an extremely important property; a logic without it would be like an arithmetic without support for certain operations, say “+”. Is classical 2VL truth functionally complete? What about SQL’s 3VL?

4.12 Let bx be a boolean expression. Then bx OR NOT bx is also a boolean expression, and in 2VL it’s guaranteed to evaluate to TRUE (it’s an example of what logicians call a tautology). Is it a tautology in 3VL? If not, is there an analogous tautology in 3VL?

4.13 With bx as in the previous exercise, bx AND NOT bx is also a boolean expression, and in 2VL it’s guaranteed to evaluate to FALSE (it’s an example of what logicians call a contradiction). Is it a contradiction in 3VL? If not, is there an analogous contradiction in 3VL?

4.14 In 2VL, r JOIN r is equal to r and INTERSECT and TIMES are both special cases of JOIN (see Chapter 6). Are these observations still valid in 3VL?

4.15 The following is a legitimate SQL row value constructor invocation: ROW (1,NULL). Is the row it denotes null or nonnull?

4.16 Let bx be an SQL boolean expression. Then NOT bx and bx IS NOT TRUE are both SQL boolean expressions. Are they equivalent?

4.17 Let x be an SQL expression. Then x IS NOT NULL and NOT (x IS NULL) are both SQL boolean expressions. Are they equivalent?

4.18 Let DEPT and EMP be SQL tables; let DNO be a column in both; let ENO be a column in EMP; and consider the expression DEPT.DNO = EMP.DNO AND EMP.DNO = 'D1' (this expression might be part of the WHERE clause in some query, for example). Now, a “good” optimizer might very well transform this expression into DEPT.DNO = EMP.DNO AND EMP.DNO = 'D1’ AND DEPT.DNO = 'D1', on the grounds that a = b and b = c together imply that a = c (see Exercise 6.13 in Chapter 6). But is this transformation valid? If not, why not? What are the implications?

4.19 Suppose the suppliers-and-parts database permits nulls; in particular, suppose columns SP.SNO and SP.PNO permit nulls.11 Here then is a query on that database, expressed for reasons beyond the scope of this chapter not in SQL but in a kind of pidgin form of relational calculus (see Chapter 10):

S WHERE NOT EXISTS SP ( SP.SNO = S.SNO AND SP.PNO = 'P2' )

What does this query mean? Is the following formulation equivalent?

S WHERE NOT ( S.SNO IN ( SP.SNO WHERE SP.PNO = 'P2' ) )

4.20 Let k1 and k2 be values of the same type. In SQL, then, what exactly does each of the following mean?

  1. k1 and k2 are “the same” for the purposes of a comparison in, e.g., a WHERE clause.

  2. k1 and k2 are “the same” for the purposes of key uniqueness.

  3. k1 and k2 are “the same” for the purposes of duplicate elimination.

4.21 In the body of the chapter, I said UNION ALL can generate duplicates. But what about INTERSECT ALL and EXCEPT ALL?

4.22 Are the recommendations “Always specify DISTINCT” and “Never specify ALL” duplicates of each other?

4.23 If TABLE_DEE corresponds to TRUE (or yes) and TABLE_DUM to FALSE (or no), then what corresponds to UNKNOWN (or maybe)?

4.24 The following quotes are taken from the SQL standard:12

  • “The data type boolean comprises the distinct truth values True and False. Unless prohibited by a NOT NULL constraint, the boolean data type also supports the truth value Unknown as the null value. This [standard] does not make a distinction between the null value of the boolean data type and the truth value Unknown ... [They] may be used interchangeably to mean exactly the same thing.”

  • “All boolean values and SQL truth values are comparable ... The value True is greater than the value False, and any comparison involving the null value or an Unknown truth value will return an Unknown result.”

Do you have any comments on these quotes? In particular, which of the following (if any) do you think are legal SQL expressions? And what do they return, if they’re legal?

  1. TRUE OR FALSE

  2. TRUE OR UNKNOWN

  3. TRUE OR NULL

  4. TRUE > FALSE

  5. TRUE > UNKNOWN

  6. TRUE > NULL

4.25 In his book Using the New DB2 (Morgan Kaufmann, 1996), in a section titled “A Brief History of SQL,” Don Chamberlin—who is widely acknowledged to be “the father of SQL”— has the following to say (I’m quoting the text more or less verbatim, except that I’ve added some italics):

During the early development of SQL ... some decisions were made that were ultimately to generate a great deal [of] controversy ... Chief among these were the decisions to support null values [sic] and to permit duplicate rows ... I will [briefly examine] the reasons for these decisions ... My purpose here is historical rather than persuasive ... I recognize that nulls and duplicates are religious topics, and I do not expect anyone to have a conversion experience after reading this chapter.

Do you agree with Chamberlin that nulls and duplicates are “religious topics”?

ANSWERS

4.1 To deal with this argument properly would take more space than we have here, but it all boils down to what’s sometimes called The Principle of Identity of Indiscernibles (see Appendix A). Let a and b be any two entities—for example, two pennies. Well, if there’s absolutely no way whatsoever of distinguishing between a and b, then there aren’t two entities but only one!13 Now, it might be true for certain purposes that the two entities can be interchanged, but that fact isn’t sufficient to make them indiscernible. (Indeed, there’s a logical difference between interchangeability and indiscernibility, and arguments to the effect that “duplicates occur naturally in the real world” tend to be based on a muddle over this difference.) A detailed analysis of this whole issue can be found in the paper “Double Trouble, Double Trouble” (see Appendix G).

4.2 Before we can answer this question, we need to pin down exactly what WHERE and UNION mean in the presence of duplicates. The paper “The Theory of Bags: An Investigative Tutorial” (see Appendix G) goes into details on such matters; here let me just say that if we adopt the SQL definitions, then the law certainly doesn’t apply. In fact, it doesn’t apply to either UNION ALL or UNION DISTINCT! By way of example, let T be an SQL table with just one column—C, say—containing just two rows, each of them containing just the value v. Then the following expressions produce the indicated results:

SELECT C
FROM   T
WHERE  TRUE
OR     TRUE

Result: v * 2.

SELECT C
FROM   T
WHERE  TRUE
UNION  DISTINCT
SELECT C
FROM   T
WHERE  TRUE

Result: v * 1.

SELECT C
FROM   T
WHERE  TRUE
UNION  ALL
SELECT C
FROM   T
WHERE  TRUE

Result: v * 4.

Note: If the various (implicit or explicit) ALLs in the foregoing expressions were all replaced by DISTINCT, it would be a different story. What do you conclude?

4.3 Remarks similar to those in the answer to the previous exercise apply here also. Again I’ll skip the details; I’ll just say for the record that, first, the answer depends, of course, on what definitions we adopt for UNION and INTERSECT for bags as opposed to sets; second, with the SQL definitions, the law doesn’t apply. I’ll leave development of a counterexample to you.

4.4 As far as I can see, the only way to resolve the ambiguity is by effectively defining a mapping from each of the (multiset) argument tables to a proper set, and likewise defining a mapping of the (multiset) result table—i.e., the desired cartesian product—to a proper set. (The mappings involve attaching a unique identifier to each row.) It seems to me, in fact, that the standard’s failed attempt at a definition here serves only to emphasize the point that one of the most fundamental concepts in the entire SQL language—viz., the idea that tables should permit duplicate rows—is fundamentally flawed, and cannot be repaired without, in effect, dispensing with the concept altogether.

Note: As I’m sure you observed, the quoted text was taken from the 1992 version of the standard. Later versions use different wording, and they do manage to fix the problem—but they do so, in effect, by adopting my solution of attaching a unique identifier to each row. I rest my case.

4.5 I don’t think this problem can be fixed.

4.6 No answer provided!

4.7 The question was: Do you think nulls occur naturally in the real world? Only you can answer this question—but if your answer is yes, I think you should examine your reasoning very carefully. For example, consider the statement, concerning some specific employee Joe, “Joe’s salary is $50,000.” That statement is either true or false. Now, you might not know whether it’s true or false; but your not knowing has nothing to do with whether it actually is true or false. In particular, your not knowing is certainly not the same as saying “Joe’s salary is null”! “Joe’s salary is $50,000” is a statement about the real world. “Joe’s salary is null” is a statement about your knowledge (or lack of knowledge, rather) about the real world. We certainly shouldn’t keep a mixture of these two very different kinds of statements in the same relation, or in the same relvar.

Suppose you had to represent the fact that you don’t know Joe’s salary in some box on some paper form. Would you enter a null, as such, into that form? I don’t think so! Rather, you would leave the box blank, or put a question mark, or write “unknown,” or something along those lines. And that blank, or question mark, or “unknown”—or whatever—is a value, not a null (recall that the one thing we can be definite about regarding nulls is that they aren’t values). Speaking for myself, therefore, no, I don’t think nulls do “occur naturally in the real world.”

4.8 True (though not in SQL!). Null is a marker that represents the absence of information, while UNKNOWN is a value, just as TRUE and FALSE are values. So there’s a logical difference between the two, and to confuse them as SQL does is a logical mistake. (I’d like to say it’s a big logical mistake, but all logical mistakes are big mistakes by definition.)

4.9 Yes, it does; SQL’s analog of “MAYBE p” is “p IS UNKNOWN”.

4.10 In 2VL there are exactly 4 monadic connectives and exactly 16 dyadic connectives, corresponding to the 4 possible monadic truth tables and 16 possible dyadic truth tables. Here are those truth tables (I’ve indicated the ones that have common names, such as NOT, AND, and OR):14

                            NOT              
───┼───        ───┼───        ───┼───        ───┼──
 T T          T T          T F          T F
 F T          F F          F T          F F

    T F       IF T F      NAND T F           T F
───┼─────      ───┼─────      ───┼─────      ───┼─────
 T T T        T T F        T F T        T F F
 F T T        F T T        F T T        F T T

OR T F           T F       XOR T F           T F
───┼─────      ───┼─────      ───┼─────      ───┼─────
 T T T        T T F        T F T        T F F
 F T F        F T F        F T F        F T F

    T F       IFF T F           T F       NOR T F
───┼─────      ───┼─────      ───┼─────      ───┼─────
 T T T        T T F        T F T        T F F
 F F T        F F T        F F T        F F T

    T F       AND T F           T F           T F
───┼─────      ───┼─────      ───┼─────      ───┼─────
 T T T        T T F        T F T        T F F
 F F F        F F F        F F F        F F F

In 3VL, by contrast, there are 27 (3 to the power 3) monadic connectives and 19,683 (3 to the power 3²) dyadic connectives. (In general, in fact, nVL has n to the power n monadic connectives and n to the power n² dyadic connectives.) Many conclusions might be drawn from these facts; one of the most immediate is that 3VL is vastly more complex than 2VL (much more so, probably, than most people, including in particular those who think nulls are a good thing, realize, or at least are prepared to admit to).

4.11 Classical 2VL supports (among other things) NOT, AND, and OR and is thus truth functionally complete, because all possible 2VL connectives can be expressed in terms of NOT and either AND or OR (see the answer to Exercise 10.4 in Chapter 10 for further explanation). And it turns out that SQL’s 3VL—under an extremely charitable interpretation of that term!—is truth functionally complete as well. The paper “Is SQL’s Three-Valued Logic Truth Functionally Complete?” (see Appendix G) discusses this issue in detail.

4.12 It’s not a tautology in 3VL, because if bx evaluates to UNKNOWN, the whole expression also evaluates to UNKNOWN. But there does exist an analogous tautology in 3VL: viz., bx OR NOT bx OR MAYBE bx. Note: This state of affairs explains why, in SQL, if you execute the query “Get all suppliers in London” and then the query “Get all suppliers not in London,” you don’t necessarily get (in combination) all suppliers; you have to execute the query “Get all suppliers who may be in London” as well. Note the implications for query rewrite; note too the potential for serious mistakes (on the part of both users and the system, I might add—and there’s some history here). To spell the point out: It’s very natural to assume that expressions that are tautologies in 2VL are also tautologies in 3VL, but such is not necessarily the case.

4.13 It’s not a contradiction in 3VL, because if bx evaluates to UNKNOWN, the whole expression also evaluates to UNKNOWN. But there does exist an analogous (slightly tricky!) contradiction in 3VL: viz., bx AND NOT bx AND NOT MAYBE bx. Note: As you might expect, this state of affairs has implications similar to those noted in the answer to the previous exercise.

4.14 In 3VL (at least as realized in SQL), r JOIN r isn’t necessarily equal to r, and INTERSECT isn’t a special case of JOIN. Why so? Because in SQL, believe it or not, two nulls don’t “compare equal” for join but do “compare equal” for intersection. (I take this state of affairs to be just another of the vast—infinite?—number of absurdities that nulls inevitably seem to lead us into.) However, TIMES is still a special case of JOIN, as it is in 2VL.

4.15 Here are the rules: Let x be an SQL row. Suppose for definiteness and simplicity that x has just two components, x1 and x2 (in left to right order, of course!). Then x IS NULL is defined to be equivalent to x1 IS NULL AND x2 IS NULL, and x IS NOT NULL is defined to be equivalent to x1 IS NOT NULL AND x2 IS NOT NULL. For the given row, both of these expressions evaluate to FALSE, and it follows that the row in question is neither null nor nonnull ... What do you conclude from this state of affairs?

By the way: At least one reviewer commented at this point that he’d never thought of a row being null. But rows are values (just as tuples and relations are values), and hence the idea of some row being unknown makes exactly as much sense as, say, the idea of some salary being unknown. Thus, if the concept of representing an unknown value by a “null” makes any sense at all—which of course I don’t think it does—then it surely applies to rows (and tables, and any other kind of value you can think of) just as much as it does to scalars. And as this exercise demonstrates, SQL tries to support this position—at least for rows—but fails. (Of course, it ought logically to support it for tables, too, but in that case it doesn’t even try. I mean, there’s no such thing as a “null table” in SQL.)

4.16 No. Here are the truth tables:

NOT         IS NOT TRUE
───┼───      ───────────┼───
 T F            T      F
 U U            U      T
 F T            F      T

4.17 No. For definiteness, consider the case in which x is an SQL row. Suppose (as in the answer to Exercise 4.15 above) that x has just two components, x1 and x2. Then x IS NOT NULL is defined to be equivalent to x1 IS NOT NULL AND x2 IS NOT NULL, and NOT (x IS NULL) is defined to be equivalent to x1 IS NOT NULL OR x2 IS NOT NULL. What do you conclude from this state of affairs?

4.18 The transformation isn’t valid, as you can see by considering what happens if EMP.DNO is null (were you surprised?). The implications, once again, are that users and the system are both likely to make mistakes (and again there’s some history here).

4.19 The query means “Get suppliers who are known not to supply part P2” (note that phrase “known not,” and note also the subtle difference between that phrase and “not known”); it does not mean “Get suppliers who don’t supply part P2.” The two formulations aren’t equivalent (consider, e.g., the case where the only SP row for part number P2 in table SP has a null supplier number).

4.20 No two of a., b., c. are equivalent. Statement a. follows the rules of SQL’s 3VL; statement b. follows the definition of SQL’s UNIQUE operator; and statement c. follows SQL’s definition of duplicates. In particular, if k1 and k2 are both null, then a. gives UNKNOWN, b. gives FALSE, and c. gives TRUE (!). Here for the record are the rules in question:

  • In SQL’s 3VL, the comparison k1 = k2 gives TRUE if k1 and k2 are both nonnull and are equal, FALSE if k1 and k2 are both nonnull and are unequal, and UNKNOWN otherwise.

  • With SQL’s UNIQUE operator, the comparison k1 = k2 gives TRUE if and only if k1 and k2 are both nonnull and are equal, and FALSE otherwise (see Chapter 11 for further explanation).

  • In SQL, k1 and k2 are duplicates if and only if either (a) they’re both nonnull and equal or (b) they’re both null.

Note: Throughout the foregoing, “equal” refers to SQL’s own, somewhat idiosyncratic definition of the “=” operator (see Chapter 2). Subsidiary exercise: Do you think these rules are reasonable? Justify your answer.

4.21 The output from INTERSECT ALL and EXCEPT ALL can indeed contain duplicates, but only if duplicates are present in the input; unlike UNION ALL, therefore, these two operators never “generate” duplicates.

4.22 Yes! (We don’t want duplicates in the database, but that doesn’t mean we never want duplicates anywhere else. As I said in the body of the chapter, there’s a logical difference between logic and rhetoric.) By the way, here’s another nice illustration of essentially the same point that I came across only recently: Good food is not cheap. Cheap food is not good.

4.23 A very good question.

4.24 Well, I don’t know about you, but I have quite a few comments myself!

  • First of all, the phrase “the null value” would be better reduced to just “null” throughout.

  • Second, observe that (as noted in Chapter 4) although SQL supports three-valued logic, its BOOLEAN data type has just two values, TRUE and FALSE; “the third truth value” is represented not by a value at all but by null. This state of affairs explains (?) the distinction drawn in the second quote between “boolean values” and “SQL truth values” —as far as SQL is concerned, there are three truth values (TRUE, FALSE, and UNKNOWN) but only two boolean values (TRUE and FALSE).

  • Next: “This [standard] does not make a distinction between the null value of the boolean data type and the truth value Unknown ... [They] may be used interchangeably to mean exactly the same thing.” But, of course, null doesn’t always mean “the third truth value,” so null and “the truth value Unknown” certainly can’t be used “interchangeably” as claimed. In fact, the keyword NULL usually can’t be used in place of the keyword UNKNOWN even when UNKNOWN is the sense intended (see c. and f. in the answer to the last part of the exercise below).

  • The phrase “the null value of the boolean data type” is also rather strange, since there’s just a single null and that null, since it isn’t a value, actually has no type at all.

  • “Unless prohibited by a NOT NULL constraint, the boolean data type also supports the truth value Unknown ...”: NOT NULL doesn’t apply to data types, it applies to uses of data types (typically as part of a column definition).

  • Formal systems (like SQL) in which the truth values are ordered usually define that ordering to be total. In particular, for three-valued logic, the ordering would typically be such that the comparisons TRUE > UNKNOWN and UNKNOWN > FALSE both return TRUE. SQL, however, defines any comparison involving UNKNOWN (even UNKNOWN = UNKNOWN) to return UNKNOWN.

  • Following on from the previous point: TRUE > UNKNOWN and UNKNOWN > FALSE (etc.) are apparently legal SQL expressions—but they’re not, according to the standard, legal “boolean value expressions” (despite the fact that they do return a boolean value ... or perhaps I should say, despite the fact that they return “an SQL truth value”).

Finally, the six SQL expressions (or would-be expressions):

  1. Legal; returns TRUE.

  2. Legal; returns null (UNKNOWN).

  3. Illegal.

  4. Legal; returns TRUE.

  5. Legal; returns null (UNKNOWN).

  6. Illegal.

4.25 No answer provided.

Get SQL and Relational Theory, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.