AVOIDING DUPLICATES IN SQL

The relational model prohibits duplicates; to use SQL relationally, therefore, steps must be taken to prevent them from occurring. Now, if every base table has at least one key (see Chapter 5), then duplicates will never occur in base tables as such. As already mentioned, however, certain SQL expressions can still yield result tables with duplicates. Here are some of the cases in which such tables can be produced:

  • SELECT ALL

  • UNION ALL

  • VALUES (i.e., table value constructor invocations)

Regarding VALUES, see Chapter 3. Regarding ALL, note first that this keyword (and its alternative, DISTINCT) can be specified:

  • In a SELECT clause, immediately following the SELECT keyword

  • In a union, intersection, or difference, immediately following the applicable keyword (UNION, INTERSECT, and EXCEPT, respectively)

  • Inside the parentheses in an invocation of a “set function” such as SUM, immediately preceding the argument expression

Note: DISTINCT is the default for UNION, INTERSECT, and EXCEPT; ALL is the default in the other cases.

Now, the “set function” case is special; you must specify ALL, at least implicitly, if you want the function to take duplicate values into account, which sometimes you do (see Chapter 7). But the other cases have to do with elimination of duplicate rows, which must always be done, at least in principle, if you want to use SQL relationally. Thus, the obvious recommendations in those cases are: Always specify DISTINCT; preferably do so explicitly; and never specify ALL. Then you can just forget about duplicate rows entirely.

In practice, however, matters aren’t quite that simple. Why not? Well, I don’t think I can do better here than repeat the essence of what I wrote in this book’s predecessor (Database in Depth, O’Reilly Media Inc., 2005):

At this point in the original draft, I added that if you find the discipline of always specifying DISTINCT annoying, don’t complain to me—complain to the SQL vendors instead. But my reviewers reacted with almost unanimous horror to my suggestion that you should always specify DISTINCT. One wrote: “Those who really know SQL well will be shocked at the thought of coding SELECT DISTINCT by default.” Well, I’d like to suggest, politely, that (a) those who are “shocked at the thought” probably know the implementations well, not SQL, and (b) their shock is probably due to their recognition that those implementations do such a poor job of optimizing away unnecessary DISTINCTs.[54] If I write SELECT DISTINCT SNO FROM S ..., that DISTINCT can safely be ignored. If I write either EXISTS (SELECT DISTINCT ...) or IN (SELECT DISTINCT ...), those DISTINCTs can safely be ignored. If I write SELECT DISTINCT SNO FROM SP ... GROUP BY SNO, that DISTINCT can safely be ignored. If I write SELECT DISTINCT ... UNION SELECT DISTINCT ..., those DISTINCTs can safely be ignored. And so on. Why should I, as a user, have to devote time and effort to figuring out whether some DISTINCT is going to be a performance hit and whether it’s logically safe to omit it?—and to remembering all of the details of SQL’s inconsistent rules for when duplicates are automatically eliminated and when they’re not?

Well, I could go on. However, I decided—against my own better judgment, but in the interest of maintaining good relations (with my reviewers, I mean)—not to follow my own advice elsewhere in this book but only to request duplicate elimination explicitly when it seemed to be logically necessary to do so. It wasn’t always easy to decide when that was, either. But at least now I can add my voice to those complaining to the vendors, I suppose.

So the recommendation (sadly) boils down to this: First, make sure you know when SQL eliminates duplicates without you asking it to. Second, in those cases where you do have to ask, make sure you know whether it matters if you don’t. Third, in those cases where it matters, specify DISTINCT (but, as Hugh Darwen once said, be annoyed about it). And never specify ALL!



[54] The implication is that SELECT DISTINCT might take longer to execute than SELECT ALL, even if the DISTINCT is effectively a “no op.” Well, that might be so; I don’t want to labor the point; I’ll just observe that the reason those implementations typically can’t optimize away unnecessary DISTINCTs is that they don’t understand how key inference works (i.e., they can’t figure out the keys that apply to the result of an arbitrary table expression). This latter issue is explored in depth in a paper by Hugh Darwen, “The Role of Functional Dependence in Query Decomposition” (see Appendix G).

Get SQL and Relational Theory, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.