A very common modeling mistake is to associate large numbers of possible characteristics within a relation, which may result in a table with a large number of columns. Some scientific disciplines may require a very detailed characterization of objects under study, and thus require a large number of attributes, but this is rarely the case in business applications. In any case, a sure sign that a database design is flawed is when columns of some prominent tables mostly contain null values , and especially when two columns cannot possibly contain a value at the same time; if one is defined, the other must be null, and vice versa. This condition would undoubtedly indicate a violation of either 2NF or 3NF.
If we admit that a row in a table represents a statement about the characteristics of a given "thing," indicating that "we don't know" for most characteristics seriously downgrades the table as a source of reliable information. This may be a minor inconvenience if the data is stored for informative purpose only. It becomes a major issue if the unknown values are supposed to help us define a result set, and this state of affairs is indicative of a flawed model. All columns in a row should ultimately contain a value, even if business processes are such that various pieces of information are entered from more than one source and/or at different points in time. A stamp collector might likewise keep some room in an album for a series temporarily absent from the collection. But even so, there is a risk of wasting storage if it is actually reserved because one always tailors for the maximum size. There is also a risk of very serious performance problems if only placeholders are used and data goes to some remote overflow area when it is entered at last.
The existence of null values also raises an important point with
regard to relational modeling, which is the main foundation for the
query optimizer. The completeness of a relational model is founded on
the application of two-valued logic ; in which things are or they
aren't. Any in-between case, a null value, is
indeterminate; but in a
conditions cannot be indeterminate. They are true or they are false,
because you return a row or you don't; you cannot return a row with a
"maybe this one answers the question but I'm not really sure" qualifier.
The transition from the three-valued logic implied
by nulls (true, false, or indeterminate) to the two-valued logic of the
result set is perilous. This is why all SQL practitioners can recall
cases when what looked like a good SQL query failed to return the proper
result set because of an encounter with null values. For instance, if a
color contains the
BLACK, this condition:
where color not in ('BLUE', 'BLACK', null)
will result in no row being returned, because we don't know what
null is and the SQL engine will
consider that there is a possibility that it might be
where color in ('BLUE', 'BLACK', null)
will return all rows for which
BLACK, and nothing else (remember, we have no
BLUE in our table), since there is a
null would be
GREEN. As you can see, an SQL engine is even
more risk-averse than a banker. Finding an explicit
null inside an
) list is, of course, unusual; but such a situation may occur
if, instead of an explicit list, we have a subquery and fail to ensure
that no null value is returned by that subquery.
A representation of customers can provide a very good example of the difficulties inherent to dealing with missing information. Each customer has an address, which is normally the address that will appear on an invoice to that customer. But what if the address to which we must ship our goods is different? Must we consider the shipping address to be a characteristic of the order? It can make sense if we sell once, only to never see customers again. If we are not a funeral parlor, however, and especially if we repeatedly ship goods to the same address, it makes no sense at all from a business point of view. Entering the same data over and over again, besides being a waste of time, also increases the risk of a mistake—hence goods get sent to the wrong address, creating a dissatisfied customer or perhaps an ex-customer. The shipping address is, obviously, a characteristic of the customer, not of the order. This situation ought to have been resolved in the analysis of dependencies during the original design of the model.
It is also possible to have the accounting department at a
location different from the official, customer delivery address if the
customer is a company. So, for one customer, we may have one "official"
address, a billing address, and also a shipping address. It is quite
common to see
customer tables with
three sets of columns (each set describing one address) for this
However, if we can have all these addresses, what is likely to be the most common case? Well, it is quite possible that in 90% of the cases we shall have only one useful address, the official address. So, what must we do with all our other columns? Two possibilities come to mind:
This is not a very sound strategy, because this will require our programs to use implicit rules , such as "if the billing address is undefined, then send the invoice to the corporate address." The logic of such programs will become much more complicated, with an increased risk of bugs entering the code.
This approach will require special processing during data entry, by a trigger perhaps. In such a case the overhead may not matter much, but in another case the overhead might matter a lot. Moreover, we must also take care of replicating changes--each update of the corporate address must be replicated to those of the other addresses that are identical, for fear of inconsistency.
Both of these scenarios betray a critical lack of understanding on the part of the original modelers. Using null values and implicit rules is a classic fudge to accommodate three-valued logic. The use of nulls inevitably introduces three-valued logic, which immediately introduces semantic inconsistency ; no amount of clever programming can remove semantic issues. Replicating data illustrates what happens when dependencies have not been properly analyzed.
One solution to our address conundrum might be to get the address
information out of the
table. One design we may contemplate is to store each address in an
address table, together with a
customer identifier and some column (a bit mask, perhaps) indicating the
role of the address. But this is not necessarily
the best solution, because issues such as the true meaning of addresses
often appear after programs have been rushed into production and an
attempt to remodel the original data as part of a later release can
introduce insuperable problems.
We have so far assumed that we have one
shipping address for each customer, which may or may not be identical to
the corporate, registered address. What if we send our invoices to a
single place but must ship our goods to many different branches, with
several distinct shipments belonging to the same invoice? This is not
necessarily unusual! It is no longer workable for our design to have a
single (mostly null) "shipping address" (represented by several columns)
customer table. We are,
ironically, back to the "shipping address is a characteristic of the
order" situation. This means that if we want to refer (especially
repeatedly) to addresses in orders, we must associate some kind of
purpose-built identifier to our addresses, which will spare us repeating
the whole shipping address in each order (normalization in action). Or
perhaps we should begin to contemplate the introduction of a
There is no such thing as the totally perfect design for the customers/addresses conundrum. I have just wandered through likely problems and tried to sketch some of the possible solutions. But there will be one solution that works best in your case, and many other solutions that will lead to the risks of inconsistencies. With an inappropriate solution, code will be at best more complicated than necessary with very high odds of being underperforming as well.
The question of null values is probably the thorniest issue of the relational theory. Dr. E.F. Codd, the father of the relational model, introduced null values early, and explicitly asked in the 3rd of the 12 rules that he published in 1985 for a systematic treatment of null values. (The 12 rules were a concise definition of the required properties of a relational database.) However, the battle is still raging among theorists. The problem is that "not known" may encompass quite a number of different cases. Let's consider a list of famous writers, each with a birth date and a death date. A null birth date would unambiguously mean "unknown." But what does a null death date mean? Alive? We don't know when this author died? We don't know whether this author is alive or not?
I cannot resist the pleasure of quoting the immortal words of the then-U.S. Secretary of Defense, Mr. Donald Rumsfeld, at a February 2002 news briefing of his department:
As we know, there are known knowns. There are things we know we know. We also know there are known unknowns. That is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know.
I don't find it unusual to have null values for, to put it in Rumsfeldese, "known unknowns," attributes that are known to exist and have some value we don't know at one point in time, for various reasons. For the rest, speculating leads nowhere. Strangely, some of the most interesting usages of null values may perfectly involve nothing but tables where all columns of all rows contain values: null values can be generated through outer joins. Some efficient techniques for checking the absence of particular values that I discuss in Chapter 6 are precisely based on outer joins and tests on null values.