The term cache has French roots and means, literally, to store. As a data processing term, caching refers to the storage of recently retrieved computer information for future reference. The stored information may or may not be used again, so caches are beneficial only when the cost of storing the information is less than the cost of retrieving or computing the information again.
The concept of caching has found its way into almost every aspect of computing and networking systems. Computer processors have both data and instruction caches. Computer operating systems have buffer caches for disk drives and filesystems. Distributed (networked) filesystems such as NFS and AFS rely heavily on caching for good performance. Internet routers cache recently used routes. The Domain Name System (DNS) servers cache hostname-to-address and other lookups.
Caches work well because of a principle known as locality of reference. There are two flavors of locality: temporal and spatial. Temporal locality means that some pieces of data are more popular than others. CNN’s home page is more popular than mine. Within a given period of time, somebody is more likely to request the CNN page than my page. Spatial locality means that requests for certain pieces of data are likely to occur together. A request for the CNN home page is usually followed by requests for all of the page’s embedded graphics. Caches use locality of reference to predict future accesses based on previous ones. When the prediction is correct, there is a significant performance improvement. In practice, this technique works so well that we would find computer systems unbearably slow without memory and disk caches. Almost all data processing tasks exhibit locality of reference and therefore benefit from caching.
When requested data is found in the cache, we call it a hit. Similarly, referenced data that is not cached is known as a miss. The performance improvement that a cache provides is based mostly on the difference in service times for cache hits compared to misses. The percentage of all requests that are hits is called the hit ratio.
Any system that utilizes caching must have mechanisms for maintaining cache consistency. This is the process by which cached copies are kept up-to-date with the originals. We say that cached data is either fresh or stale. Caches can reuse fresh copies immediately, but stale data usually requires validation. The algorithms that are to maintain consistency may be either weak or strong. Weak consistency means that the cache sometimes returns outdated information. Strong consistency, on the other hand, means that cached data is always validated before it is used. CPU and filesystem caches require strong consistency. However, some types of caches, such as those in routers and DNS resolvers, are effective even if they return stale information.
We know that caching plays an important role in modern computer memory and disk systems. Can it be applied to the Web with equal success? Ask different people and you’re likely to get different answers. For some, caching is critical to making the Web usable. Others view caching as a necessary evil. A fraction probably consider it just plain evil [Tewksbury, 1998].
In this book, I’ll talk about applying caching techniques to the World Wide Web and try to convince you that web caching is a worthwhile endeavor. We’ll see how web caches work, how they interact with clients and servers, and the role that HTTP plays. You’ll learn about a number of protocols that are used to build cache clusters and hierarchies. In addition to talking about the technical aspects, I also spend a lot of time on the issues and politics. The Web presents some interesting problems due to its highly distributed nature.
After you’ve read this book, you should be able to design and evaluate a caching proxy solution for your organization. Perhaps you’ll install a single caching proxy on your firewall, or maybe you need many caches located throughout your network. Furthermore, you should be well prepared to understand and diagnose any problems that may arise from the operation or failure of your caches. If you’re a content provider, then I hope I’ll have convinced you to increase the cachability of the information you serve.
Before we can talk more about caching, we need to agree on some terminology. Whenever possible, I use words and meanings taken from Internet standards documents. Unfortunately, colloquial usage of web caching terminology is often just different enough to be confusing.
The fundamental building blocks of the Web (and indeed most distributed systems) are clients and servers. A web server manages and provides access to a set of resources. The resources might be simple text files and images, or something more complex, such as a relational database. Clients, also known as user agents, initiate a transaction by sending a request to a server. The server then processes the request and sends a response back to the client.
On the Web, most transactions are download operations; the client downloads some information from the server. In these cases, the request itself is quite small (about 200 bytes) and contains the name of the resource, plus a small amount of additional information from the client. The information being downloaded is usually an image or text file with an average size of about 10,000 bytes. This characteristic of the Web makes cable- and satellite-based Internet services viable. The data rates for receiving are much higher than the data rates for sending because web users mostly receive information.
A small percentage of web transactions are more correctly characterized as upload operations. In these cases, requests are relatively large and responses are very small. Examples of uploads include sending an email message and transferring an image file from your computer to a server.
The most common web clients are called browsers. These are applications such as Netscape Navigator and Microsoft Internet Explorer. The purpose of a browser is to render the web content for us to view and interact with. Because of the myriad of features present in web browsers, they are really very large and complicated programs. In addition to the GUI-based clients, there are a few simple command-line client programs, such as Lynx and Wget.
A number of different servers are in widespread use on the Web. The Apache HTTP server is a popular choice and freely available. Netscape, Microsoft, and other companies also have server products. Many content providers are concerned with the performance of their servers. The most popular sites on the Net can receive ten million requests per day with peak request rates of 1000 per second. At this scale, both the hardware and software must be very carefully designed to cope with the load. Many sites run multiple servers in parallel to handle their high request rates and for redundancy.
Recently, there has been a lot of excitement surrounding peer-to-peer applications, such as Napster. In these systems, clients share files and other resources (e.g., CPU cycles) directly with each other. Napster, which enables people to share MP3 files, does not store the files on its servers. Rather, it acts as a directory and returns pointers to files so that two clients can communicate directly. In the peer-to-peer realm, there are no centralized servers; every client is a server.
The peer-to-peer movement is relatively young but already very popular. It’s likely that a significant percentage of Internet traffic today is due to Napster alone. However, I won’t discuss peer-to-peer clients in this book. One reason for this is that Napster uses its own transfer protocol, whereas here we’ll focus on HTTP.
Much of this book is about proxies. A proxy is an intermediary in a web transaction. It is an application that sits somewhere between the client and the origin server. Proxies are often used on firewalls to provide security. They allow (and record) requests from the internal network to the outside Internet.
A proxy behaves like both a client and a server. It acts like a server to clients, and like a client to servers. A proxy receives and processes requests from clients, and then it forwards those requests to origin servers. Some people refer to proxies as “application layer gateways.” This name reflects the fact that the proxy lives at the application layer of the OSI reference model, just like clients and servers. An important characteristic of an application layer gateway is that it uses two TCP connections: one to the client and one to the server. This has important ramifications for some of the topics we’ll discuss later.
Proxies are used for a number of different things, including logging, access controls, filtering, translation, virus checking, and caching. We’ll talk more about these and the issues they create in Chapter 3.
I use the term object to refer to the entity exchanged between a client and a server. Some people may use document or page, but these terms are misleading because they imply textual information or a collection of text and images. “Object” is generic and better describes the different types of content returned from servers, such as audio files, ZIP files, and C programs. The standards documents (RFCs) that describe web components and protocols prefer the terms entity, resource, and response. My use of object corresponds to their use of entity, where an object (entity) is a particular response generated from a particular resource. Web objects have a number of important characteristics, including size (number of bytes), type (HTML, image, audio, etc.), time of creation, and time of last modification.
In broad terms, web resources can be considered either dynamic or static. Responses for dynamic resources are generated on the fly when the request is made. Static responses are pregenerated, independent of client requests. When people think of dynamic responses, often what comes to mind are stock quotes, live camera images, and web page counters. Digitized photographs, magazine articles, and software distributions are all static information. The distinction between dynamic and static content is not necessarily so clearly defined. Many web resources are updated at various intervals (perhaps daily) but not uniquely generated on a per-request basis. The distinction between dynamic and static resources is important because it has serious consequences for cache consistency.
Resource identifiers are a fundamental piece of the architecture of the Web. These are the names and addresses for web objects, analogous to street addresses and telephone numbers. Officially, they are called Universal Resource Identifiers, or URIs. They are used by both people and computers alike. Caches use them to identify and index the stored objects. According to the design specification, RFC 2396, URIs must be extensible, printable, and able to encode all current and future naming schemes. Because of these requirements, only certain characters may appear in URIs, and some characters have special meanings.
URLs have a very important characteristic worth mentioning here. Every URL includes a network host address—either a hostname or an IPaddress. Thus, a URL is bound to a specific server, called the origin server. This characteristic has some negative side effects for caching. Occasionally, the same resource exists on two or more servers, as occurs with mirror sites. When a resource has more than one name, it can get cached under different names. This wastes storage space and bandwidth.
Uniform Resource Names (URNs) are similar to URLs, but they refer to resources in a location-independent manner. RFC 2141 describes URNs, which are also sometimes called persistent names. Resources named with URNs can be moved from one server (location) to another without causing problems. Here are some sample (hypothetical) URNs:
In 1995, the World Wide Web Project left its birthplace at CERN in Geneva, Switzerland, and became the World Wide Web Consortium. In conjunction with this move, their web site location changed from info.cern.ch to http://www.w3c.org. Everyone who used a URL with the old location received a page with a link to the new location and a reminder to “update your links and hotlist.” Had URNs been implemented and in use back then, such a problem could have been avoided.
Another advantage of URNs is that a single name can refer to a resource replicated at many locations. When an application processes such a URN request, it must select one of the locations (presumably the closest or fastest) from which to retrieve the object. RFC 2168 describes methods for resolving URNs.
Unfortunately, URNs have been very slow to catch on. Very few applications are able to handle URNs, while everyone and everything knows about URLs. Through the remainder of this book, I’ll use both URI and URL somewhat interchangeably. I won’t say much more about URNs, but keep in mind that URI is a generic term that refers to both URLs and URNs.