You have seen in the first part of this chapter exactly how SQL includes both relational and non-relational characteristics. You have also seen how this affects the efficient (and not-so-efficient) workings of the database optimizer. From this point forward, and bearing in mind the lessons of the first part of this chapter, we can concentrate on the key factors that must be considered when using SQL. In my view, there are five main factors:
The volume of data we need to read is probably the most
important factor to take into account; an execution plan that is
perfectly suitable for a fourteen-row
emp table and a four-row
dept table may be entirely inappropriate for
dealing with a 15 million-row
financial_flows table against which we have
to join a 5 million-row
table. Note that even a 15 million-row table will not be considered
particularly large by the standards of many companies. As a matter of
consequence, it is hard to pronounce on the efficiency of a query
before having run it against the target volume of data.
When we write an SQL statement, in most cases it will involve
filtering conditions located in
clauses, and we may have several
where clauses—a major one as well as minor
ones—in subqueries or views (regular views or in-line views). A
filtering condition may be efficient or inefficient. However, the
significance of efficient or
inefficient is strongly affected by other
factors, such as physical implementation (as discussed in Chapter 5) and once again, by how much
data we have to wade through.
We need to approach the subject of defining the result in several parts, by considering filtering, the central SQL statements, and the impact of large data volumes on our queries. But this is a particularly complex area that needs to be treated in some depth, so I'll reserve this discussion until later in this chapter, in the major section entitled "Filtering."
An important and often overlooked factor is how much data a query returns (or how much data a statement changes). This is often dependent on the size of the tables and the details of the filtering, but not in every case. Typically, the combination of several selection criteria which of themselves are of little value in selecting data may result in highly efficient filtering when used in combination with one another. For example, one could cite that retrieving students' names based on whether they received a science or an arts degree will give a large result set, but if both criteria are used (e.g., students who studied under both disciplines) the consequent result set will collapse to a tiny number.
In the case of queries in particular, the size of the result set
matters not so much from a technical standpoint, but mostly because of
the end user's perception. To a very large extent, end users adjust
their patience to the number of rows they expect: when they ask for
one needle, they pay little attention to the size of the haystack. The
extreme case is a query that returns nothing, and a good developer
should always try to write queries that return few or no rows as fast
as possible. There are few experiences more frustrating than waiting
for several minutes before finally seeing a "no data found" message.
This is especially annoying if you have mistyped something, realized
your error just after hitting Enter, and then have been unable to
abort the query. End users are willing to wait to get a lot of data,
but not to get an empty result. If we consider that each of our
filtering criteria defines a particular result set and the final
result set is either the intersection (when conditions are
anded) or the union (when conditions are
ored together) of all the
intermediate result sets , a zero result is most likely to result from the
intersection of small, intermediate result sets. In other words, the
(relatively) most precise criteria are usually the primary reason for
a zero result set. Whenever there is the slightest possibility that a
query might return no data, the most likely condition that would
result in a null return should be checked first—especially if it can
be done quickly. Needless to say, the order of evaluation of criteria
is extremely context-sensitive as you shall see later under "Filtering."
The number of tables involved in a query will naturally have some influence on performance. This is not because a DBMS engine performs joins badly—on the contrary, modern systems are able to join large numbers of tables very efficiently.
The perception of poor join performance is another enduring myth associated with relational databases. Folklore has it that one should not join too many tables, with five often suggested as the limit. In fact, you can quite easily have 15 table joins perform extremely well. But there are additional problems associated with joining a large number of tables, of which the following are examples:
When you routinely need to join, say, 15 tables, you can legitimately question the correctness of the design; keep in mind what I said in Chapter 1—that a row in a table states some kind of truth and can be compared to a mathematical axiom. By joining tables, we derive other truths. But there is a point at which we must decide whether something is an obvious truth that we can call an axiom, or whether it is a less obvious truth that we must derive. If we spend much of our time deriving our truths, perhaps our axioms are poorly chosen in the first place.
For the optimizer, the complexity increases exponentially as the number of tables increases. Once again, the excellent work usually performed by a statistical optimizer may comprise a significant part of the total response time for a query, particularly when the query is run for the first time. With large numbers of tables, it is quite impractical for the optimizer to explore all possible query paths. Unless a query is written in a way that eases the work of the optimizer, the more complex the query, the greater the chance that the optimizer will bet on the wrong horse.
When we write a complex query involving many tables, and
when joins can be written in several fairly distinct ways, the
odds are high that we'll pick the wrong construct. If we join
tables A to B to C to D, the optimizer may not have all the
information present to know that A can be very efficiently
joined directly to D, particularly if that join happens to be a
special case. A sloppy developer trying to fix duplicate rows
distinct can also
easily overlook a missing join condition.
Be aware that the apparent number of tables involved in a query can be deceptive; some of the tables may actually be views, and sometimes pretty complex ones, too. Just as with queries, views can also have varying degrees of complexity. They can be used to mask columns, rows, or even a combination of rows and columns to all but a few privileged users. They can also be used as an alternate perspective on the data, building relations that are derived from the existing relations stored as tables. In cases such as these, a view can be considered shorthand for a query, and this is probably one of the most common usages of views. With increasingly complex queries, there is a temptation to break a query down into a succession of individual views, each representing a component of the greater query.
Like most extreme positions, it would be absurd to banish views altogether. Many of them are rather harmless animals. However, when a view is itself used in a rather complex query, in most cases we are only interested in a small fraction of the data returned by the view—possibly in a couple of columns, out of a score or more. The optimizer may attempt to recombine a simple view into a larger query statement. However, once a query reaches a relatively modest level of complexity, this approach may become too complex in itself to enable efficient processing.
In some cases a view may be written in a way that effectively
prevents the optimizer from combining it into the larger statement.
I have already mentioned
those virtual columns used in Oracle to indicate the order in which
rows are initially found. When
rownums are used inside a view, a further
level of complexity is introduced. Any attempt to combine a view
that references a
rownum into a
larger statement would be almost guaranteed to change the subsequent
rownum order, and therefore the
optimizer doesn't permit a query rewrite in those circumstances. In
a complicated query, such a view will necessarily be executed in
isolation. In quite a number of cases then, the DBMS optimizer will
push a view as is into a statement,[*] running it as a step in the statement execution, and
using only those elements that are required from the result of the
Frequently, many of the operations executed in a view
(typically joins to return a description associated with codes) will
be irrelevant in the context of a larger query, or a query may have
special search criteria that would have been particularly selective
when applied to the tables underlying the view. For instance, a
union may prove to be
totally unnecessary because the view is the
union of several tables representing
subtypes, and the larger query filters on only one of the subtypes.
There is also the danger of joining a view with a table that itself
appears in the same view, thus forcing multiple passes over this
table and probably hitting the same rows several times when one pass
would have been quite sufficient.
When a view returns much more data than required in the context of a query that references that view, dramatic performance gains can often be obtained by eliminating the view (or using a simpler version of the view). Begin by replacing the view reference in the main query with the underlying SQL query used to define the view. With the components of the view in full sight, it becomes easy to remove everything that is not strictly necessary. More often than not, it's precisely what isn't necessary that prevents the view from being merged by the optimizer, and a simpler, cut-down view may give excellent results. When the query is correctly reduced to its most basic components, it runs much faster.
Many developers may hesitate to push the code for a very complex view into an already complex query, not least because it can make a complex situation even more complicated. The exercise of developing and factoring a complex SQL expression may indeed appear to be daunting. It is, however, an exercise quite similar to the development of mathematical expressions, as practiced in high school. It is, in my view, a very formative exercise and well worth the effort of mastering. It is a discipline that provides a very sound understanding of the inner workings of a query for developers anxious to improve their skills, and in most cases the results can be highly rewarding.
Finally, concurrency is a factor that you must carefully take into account when designing your SQL code. Concurrency is usually a concern while writing to the database where block-access contention , locking , latching (which means locking of internal DBMS resources), and others are the more obvious problem areas; even read consistency can in some cases lead to some degree of contention. Any server, no matter how impressive its specification, will always have a finite capacity. The ideal plan for a query running on a machine with little to no concurrency is not necessarily the same as the ideal plan for the same query running on the same machine with a high level of concurrency. Sorts may no longer find the memory they need and may instead resort to writing to disk, thus creating a new source of contention. Some CPU-intensive operations—for example, the computation of complicated functions, repetitive scanning of index blocks, and so forth—may cause the computer to overload. I have seen cases in which more physical I/Os resulted in a significantly better time to perform a given task. In those cases, there was a high level of concurrency for CPU-intensive operations, and when some processes had to wait for I/Os, the overworked CPUs were relieved and could run other processes, thus ensuring a better overlap. We must often think in terms of global throughput of one particular business task, rather than in terms of individual user response-time.
Chapter 9 examines concurrency in greater detail.
[*] The optimizer may also sometimes push criteria down into the view.