DUPLICATES: FURTHER ISSUES

There’s much, much more that could be said regarding duplicates and what’s wrong with them, but I’ll limit myself here to just three further points. First of all, you might reasonably object that in practice base tables, at least, never do include duplicates, and the foregoing example thus intuitively fails. True enough (probably); but the trouble is, SQL can generate duplicates in query results. Indeed, different formulations of the same query can produce results with different degrees of duplication, even if the input tables themselves have no duplicates at all. For example, here are two possible formulations of the query “Get supplier numbers for suppliers who supply at least one part” on our usual suppliers-and-parts database (and note here that the input tables certainly don’t contain any duplicates):

image with no caption

At least one of these expressions—which?—will produce a result with duplicates, in general. (Exercise: Given our usual sample data values, what results do the two expressions produce?) So if you don’t want to think of the tables in Figure 4-1 as base tables specifically, fine: Just take them to be the output from previous queries, and the rest of the analysis goes through unchanged.

Second, there’s another at least psychological argument against duplicates that I think is quite persuasive (thanks to Jonathan Gennick for this one): If, in accordance with the n-dimensional perspective on relations introduced in Chapter 3, you think of a table as a plot of points in some n-dimensional space, then duplicate rows clearly don’t add anything—they simply amount to plotting the same point twice.

My last point is this. Suppose table T does permit duplicates. Then we can’t tell the difference between “genuine” duplicates in T and duplicates that arise from errors in data entry on T! For example, suppose the person responsible for data entry unintentionally enters the very same row twice—e.g., by inadvertently hitting the return key twice (easily done, by the way). Then there’s no straightforward way to delete the “second” row without deleting the “first” as well. Note that we presumably do want to delete that “second” row, since it shouldn’t have been entered in the first place.

Get SQL and Relational Theory, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.