Chapter 1. Introduction

Database Design

As mentioned in the Preface, one purpose of this book is to explain the basic concepts of modern relational database theory and show how these concepts are realized in Microsoft Access. Allow me to amplify on this rather lofty goal.

To take a very simple view, which will do nicely for the purposes of this introductory discussion, a database is just a collection of related data. A database management system , or DBMS, is a system that is designed for two main purposes:

  • To add, delete, and update the data in the database

  • To provide various ways to view (on screen or in print) the data in the database

If the data are simple, and there is not very much data, then a database can consist of a single table. In fact, a simple database can easily be maintained even with a word processor!

To illustrate, suppose you want to set up a database for the books in a library. Purely for the sake of illustration, suppose the library contains 14 books. The same discussion would apply to a library of perhaps a few hundred books. Table 1.1 shows the LIBRARY_FLAT database in the form of a single table.

Table 1-1. The LIBRARY_FLAT Sample Database

ISBN

Title

AuID

AuName

AuPhone

PubID

PubName

PubPhone

Price

1-1111-1111-1

C++

4

Roman

444-444-4444

1

Big House

123-456-7890

$29.95

0-99-999999-9

Emma

1

Austen

111-111-1111

1

Big House

123-456-7890

$20.00

0-91-335678-7

Faerie Queene

7

Spenser

777-777-7777

1

Big House

123-456-7890

$15.00

0-91-045678-5

Hamlet

5

Shakespeare

555-555-5555

2

Alpha Press

999-999-9999

$20.00

0-103-45678-9

Iliad

3

Homer

333-333-3333

1

Big House

123-456-7890

$25.00

0-12-345678-9

Jane Eyre

1

Bronte

111-111-1111

3

Small House

714-000-0000

$49.00

0-99-777777-7

King Lear

5

Shakespeare

555-555-5555

2

Alpha Press

999-999-9999

$49.00

0-555-55555-9

Macbeth

5

Shakespeare

555-555-5555

2

Alpha Press

999-999-9999

$12.00

0-11-345678-9

Moby Dick

2

Melville

222-222-2222

3

Small House

714-000-0000

$49.00

0-12-333433-3

On Liberty

8

Mill

888-888-8888

1

Big House

123-456-7890

$25.00

0-321-32132-1

Balloon

13

Sleepy

321-321-1111

3

Small House

714-000-0000

$34.00

0-321-32132-1

Balloon

11

Snoopy

321-321-2222

3

Small House

714-000-0000

$34.00

0-321-32132-1

Balloon

12

Grumpy

321-321-0000

3

Small House

714-000-0000

$34.00

0-55-123456-9

Main Street

10

Jones

123-333-3333

3

Small House

714-000-0000

$22.95

0-55-123456-9

Main Street

9

Smith

123-222-2222

3

Small House

714-000-0000

$22.95

0-123-45678-0

Ulysses

6

Joyce

666-666-6666

2

Alpha Press

999-999-9999

$34.00

1-22-233700-0

Visual Basic

4

Roman

444-444-4444

1

Big House

123-456-7890

$25.00

(Columns labeled AuID and PubID are included for indentification purposes, i.e., to uniquely identify an author or a publisher. In any case, their presence or absence will not affect the current discussion.)

LIBRARY_FLAT (Table 1.1) was created using Microsoft Word. For such a simple database, Word has enough power to fulfill the two goals mentioned earlier. Certainly, adding, deleting, and editing the table presents no particular problems (provided we know how to manage tables in Word). In addition, if we want to sort the data by author, for example, we can just select the table and choose Sort from the Table menu in Microsoft Word. Extracting a portion of the data in the DELETE_ME table (i.e., creating a view) can be done by making a copy of the table and then deleting appropriate rows and/or columns.

Why Use a Relational Database Design?

Thus, maintaining a simple, so-called flat database consisting of a single table does not require much knowledge of database theory. On the other hand, most databases worth maintaining are quite a bit more complicated than that. Real-life databases often have hundreds of thousands or even millions of records, with data that are very intricately related. This is where using a full-fledged relational database program becomes essential. Consider, for example, the Library of Congress, which has over 16 million books in its collection. For reasons that will become apparent soon, a single table simply will not do for this database!

Redundancy

The main problems associated with using a single table to maintain a database stem from the issue of unnecessary repetition of data, that is, redundancy . Some repetition of data is always necessary, as we will see, but the idea is to remove as much unnecessary repetition as possible.

The redundancy in the LIBRARY_FLAT table (Table 1.1) is obvious. For instance, the name and phone number of Big House publishers is repeated six times in the table, and Shakespeare’s phone number is repeated thrice.

In an effort to remove as much redundancy as possible from a database, a database designer must split the data into multiple tables. Here is one possibility for the LIBRARY_FLAT example, which splits the original database into four separate tables.

  • A BOOKS table, shown in Table 1.2, in which each book has its own record

  • An AUTHORS table, shown in Table 1.3, in which each author has his or her own record

  • A PUBLISHERS table, shown in Table 1.4, in which each publisher has its own record

  • BOOK/AUTHOR table, shown in Table 1.5, the purpose of which we will explain a bit later

Table 1-2. The BOOKS Table from the LIBRARY_FLAT Database

ISBN

Title

PubID

Price

0-555-55555-9

Macbeth

2

$12.00

0-91-335678-7

Faerie Queene

1

$15.00

0-99-999999-9

Emma

1

$20.00

0-91-045678-5

Hamlet

2

$20.00

0-55-123456-9

Main Street

3

$22.95

1-22-233700-0

Visual Basic

1

$25.00

0-12-333433-3

On Liberty

1

$25.00

0-103-45678-9

Iliad

1

$25.00

1-1111-1111-1

C++

1

$29.95

0-321-32132-1

Balloon

3

$34.00

0-123-45678-0

Ulysses

2

$34.00

0-99-777777-7

King Lear

2

$49.00

0-12-345678-9

Jane Eyre

3

$49.00

0-11-345678-9

Moby Dick

3

$49.00

Table 1-3. The AUTHORS Table from the LIBRARY_FLAT Database

AuID

AuName

AuPhone

1

Austen

111-111-1111

12

Grumpy

321-321-0000

3

Homer

333-333-3333

10

Jones

123-333-3333

6

Joyce

666-666-6666

2

Melville

222-222-2222

8

Mill

888-888-8888

4

Roman

444-444-4444

5

Shakespeare

555-555-5555

13

Sleepy

321-321-1111

9

Smith

123-222-2222

11

Snoopy

321-321-2222

7

Spenser

777-777-7777

Table 1-4. The PUBLISHERS Table from the LIBRARY_FLAT Database

PubID

PubName

PubPhone

1

Big House

123-456-7890

2

Alpha Press

999-999-9999

3

Small House

714-000-0000

Table 1-5. The BOOK/AUTHOR Table from the LIBRARY_FLAT Database

ISBN

AuID

0-103-45678-9

3

0-11-345678-9

2

0-12-333433-3

8

0-12-345678-9

1

0-123-45678-0

6

0-321-32132-1

11

0-321-32132-1

12

0-321-32132-1

13

0-55-123456-9

9

0-55-123456-9

10

0-555-55555-9

5

0-91-045678-5

5

0-91-335678-7

7

0-99-777777-7

5

0-99-999999-9

1

1-1111-1111-1

4

1-22-233700-0

4

Note that now the name and phone number of Big House appears only once in the database (in the PUBLISHERS table), as does Shakespeare’s phone number (in the AUTHORS table).

Of course, there are still some duplicated data in the database. For instance, the PubID information appears in more than one place in these tables. As mentioned earlier, we cannot eliminate all duplicate data and still maintain the relationships between the data.

To get a feel for the reduction in duplicate data achieved by the four-table approach, imagine (as is reasonable) that the database also includes the address of each publisher. Then Table 1.1 would need a new column containing 14 addresses—many of which are duplicates. On the other hand, the four-table database needs only one new column in the PUBLISHERS table, adding a total of three distinct addresses.

To drive the difference home, consider the 16-million-book database of the Library of Congress. Suppose the database contains books from 10,000 different publishers. A publisher’s address column in a flat database design would contain 16 million addresses, whereas a multitable approach would require only 10,000 addresses. Now, if the average address is 50 characters long, then the multitable approach would save

(16,000,000 – 10,000) 50 = 799 million characters

Assuming that each character takes 2 bytes (in the Unicode that is used internally by Microsoft Access), the single-table approach wastes about 1.6 gigabytes of space, just for the address field!

Indeed, the issue of redundancy alone is quite enough to convince a database designer to avoid the flat database approach. However, there are several other problems with flat databases, which we now discuss.

Multiple-value problems

It is clear that some books in our database are authored by multiple authors. This leaves us with three choices in a single-table flat database:

  • We can accommodate multiple authors with multiple rows—one for each author, as in the LIBRARY_FLAT table (Table 1.1) for the books Balloon and Main Street.

  • We can accommodate multiple authors with multiple columns in a single row—one for each author.

  • We can include all authors’ names in one column of the table.

The problem with the multiple-row choice is that all of the data about a book must be repeated as many times as there are authors of the book—an obvious case of redundancy. The multiple column approach presents the problem of guessing how many Author columns we will ever need, and creates a lot of wasted space (empty fields) for books with only one author. It also creates major programming headaches.

The third choice is to include all authors’ names in one cell, which can lead to trouble of its own. For example, it becomes more difficult to search the database for a single author. Worse yet, how can we create an alphabetical list of the authors in the table?

Update anomalies

In order to update, say, a publisher’s phone number in the LIBRARY_FLAT database (Table 1.1), it is necessary to make changes in every row containing that number. If we miss a row, we have produced a so-called update anomaly , resulting in an unreliable table.

Insertion anomalies

Difficulties will arise if we wish to insert a new publisher in the LIBRARY_FLAT database (Table 1.1), but we do not yet have information about any of that publisher’s books. We could add a new row to the existing table and place NULL values in all but the three publisher-related columns, but this may lead to trouble. (A NULL is a value intended to indicate a missing or unknown value for a field.) For instance, adding several such publishers means that the ISBN column, which should contain unique data, will contain several NULL values. This general problem is referred to as an insertion anomaly.

Deletion anomalies

In contrast to the preceding problem, if we delete all book entries for a given publisher, for instance, then we will also lose all information about that publisher. This is a deletion anomaly .

This list of potential problems should be enough to convince us that the idea of using a single-table database is generally not smart. Good database design dictates that the data be divided into several tables, and that relationships be established between these tables. Because a table describes a “relation,” such a database is called a relational database. On the other hand, relational databases do have their complications. Here are a few examples.

Avoiding data loss

One complication in designing a relational database is figuring out how to split the data into multiple tables so as not to lose any information. For instance, if we had left out the BOOK/AUTHOR table (Table 1.5) in our previous example, there would be no way to determine the authors of each book. In fact, the sole purpose of the BOOK/AUTHOR table is so that we do not lose the book/author relationship!

Maintaining relational integrity

We must be careful to maintain the integrity of the various relationships between tables when changes are made. For instance, if we decide to remove a publisher from the database, it is not enough just to remove that publisher from the PUBLISHERS table, for this would leave dangling references to that publisher in the BOOKS table.

Creating views

When the data are spread throughout several tables, it becomes more difficult to create various views of the data. For instance, we might want to see a list of all publishers that publish books priced under $10.00. This requires gathering data from more than one table. The point is that, by breaking data into separate tables, we must often go to the trouble of piecing the data back together in order to get a comprehensive view of those data!

Summary

In summary, it is clear that, to avoid redundancy problems and various unpleasant anomalies, a database needs to contain multiple tables, with relationships defined between these tables. On the other hand, this raises some issues, such as how to design the tables in the database without losing any data, and how to piece together the data from multiple tables to create various views of that data. The main goal of the first part of this book is to explore these fundamental issues.

Get Access Database Design and Programming, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.