Web Transport Protocols

Clients and servers use a number of different transport protocols to exchange information. These protocols, built on top of TCP/IP, comprise the majority of all Internet traffic today. The Hypertext Transfer Protocol (HTTP) is the most common because it was designed specifically for the Web. A number of legacy protocols, such as the File Transfer Protocol (FTP) and Gopher, are still in use today. According to Merit’s measurements from the NSFNet, HTTP replaced FTP as the dominant protocol in April of 1995.[2] Some newer protocols, such as Secure Sockets Layer (SSL) and the Real-time Transport Protocol (RTP), are increasing in use.

HTTP

Tim Berners-Lee and others originally designed HTTP to be a simple and lightweight transfer protocol. Since its inception, HTTP has undergone three major revisions. The very first version, retroactively named HTTP/0.9, is extremely simple and almost trivial to implement. At the same time, however, it lacks any real features. The second version, HTTP/1.0 [Berners-Lee, Fielding and Frystyk, 1996], defines a small set of features and still maintains the original goals of being simple and lightweight. However, at a time when the Web was experiencing phenomenal growth, many developers found that HTTP/1.0 did not provide all the functionality they required for new services.

The HTTP Working Group of the Internet Engineering Task Force (IETF) has worked long and hard on the protocol specification for HTTP/1.1. New features in this version include persistent connections, range requests, content negotiation, and improved cache controls. RFC 2616 is the latest standards track document describing HTTP/1.1. Unlike the earlier versions, HTTP/1.1 is a very complicated protocol.

HTTP transactions use a well-defined message structure. A message, which can be either a request or a response, has two parts: the headers and the body. Headers are always present, but the body is optional. Headers are represented as ASCII strings terminated by carriage return and linefeed characters. An empty line indicates the end of headers and the start of the body. Message bodies are treated as binary data. The headers are where we find information and directives relevant to caching.

An HTTP header consists of a name followed by a colon and then one or more values separated by commas. Multiword names are separated with dashes. Header names and reserved words are case-insensitive. For example, these are all HTTP headers:

Host: www.slashdot.org
Content-type: text/html
Date: Sat, 03 Mar 2001 13:41:06 GMT
Cache-control: no-cache,private,no-store

HTTP defines four categories of headers: entity, request, response, and general. Entity headers describe something about the data in the message body. For example, Content-length is an entity header. It describes the length of the message body. Request headers should appear only in HTTP requests and are meaningless for responses. Host and If-modified-since are request headers. Response headers, obviously, apply only to HTTP responses. Age is a response header. Finally, general headers are dual-purpose: they can be found in both requests and responses. Cache-control is a general header, one that we’ll talk about often.

The first line of an HTTP message is special. For requests, it’s called the request line and contains the request method, a URI, and an HTTP version number. For responses, the first line is called the status-line, and it includes an HTTP version number and a status code that indicates the success or failure of the request. Note that most request messages do not have a body, but most response messages do.

Here’s a simple GET request:

GET /index.html HTTP/1.1
Host: www.web-cache.com
Accept: */*

And here’s a simple POST request with a message body:

POST /cgi-bin/query.pl HTTP/1.1
Host: www.web-cache.com
Accept: */*
Content-Length: 19

args=foo+bar&max=10

Here’s a successful response:

HTTP/1.1 200 Ok
Date: Wed, 21 Feb 2001 09:57:56 GMT
Last-Modified: Mon, 19 Feb 2001 20:45:26 GMT
Server: Apache/1.2.5
Content-Length: 13
Content-Type: text/plain

Hello, world.

And here’s an error response:

HTTP/1.0 404 Not Found
Date: Fri, 23 Feb 2001 00:46:54 GMT
Server: Apache/1.2.5
content-Type: text/html

<HTML><HEAD>
<TITLE>404 File Not Found</TITLE>
</HEAD><BODY>
<H1>File Not Found</H1>
The requested URL /foo.bar was not found on this server.<P>
</BODY></HTML>

RFC 2616 defines the request methods listed in Table 1-1. Other RFCs, such as 2518, define additional methods for HTTP. Applications may even make up their own extension methods, although proxies are not required to support them. A proxy that receives a request with an unknown or unsupported method should respond with a 405 (Method Not Allowed) message. The descriptions in Table 1-1 are necessarily brief. Refer to Section 9 of RFC 2616 for full details.

Table 1-1. HTTP Request Methods Defined by RFC 2616

MethodDescription
GET

A request for the information identified by the request URI.

HEAD

Identical to GET, except the response does not include a message body.

POST

A request for the server to process the data present in the message body.

PUT

A request to store the enclosed body in the named URI.

TRACE

A “loopback” method that essentially echoes a request back to the client. It is also useful for discovering and testing proxies between the client and the server.

DELETE

A request to remove the named URI from the origin server.

OPTIONS

A request for information about a server’s capabilities or support for optional features.

CONNECT

Used to tunnel certain protocols, such as SSL, through a proxy.

For our purposes, GET, HEAD, and POST are the only interesting request methods. I won’t say much about the others in this book. We’ll talk more about HTTP in Chapter 2.

FTP

The File Transfer Protocol (FTP) has been in use since the early years of the Internet (1971). The current standard document, RFC 959, by Postel, is very different from the original specification, RFC 172. FTP consumed more Internet backbone bandwidth than any other protocol until about March of 1995.

An FTP session is a bit more complicated than an HTTP transaction. FTP uses a control channel for commands and responses and a separate data channel for actual data transfer. Before data transfer can occur, approximately six command and reply exchanges take place on the control channel. FTP clients must “log in” to a server with a username and password. Many servers allow anonymous access to their publicly available files. Because FTP is primarily intended to give access to remote filesystems, the protocol supports commands such as CWD (change working directory) and LST (directory listing). These differences make FTP somewhat awkward to implement in web clients. Regardless, FTP remains a popular way of making certain types of information available to Internet and web users.

SSL/TLS

Netscape invented the Secure Sockets Layer (SSL) protocol in 1994 to foster electronic commerce applications on the Internet. SSL provides secure, end-to-end encryption between clients and servers. Before SSL, people were justifiably afraid to conduct business online due to the relative ease of sniffing network traffic. The development and standardization of SSL has moved into the IETF, where it is now called Transport Layer Security (TLS) and documented in RFC 2246.

The TLS protocol is not restricted to HTTP and the Web. It can be used for other applications, such as email (SMTP) and newsgroups (NNTP). When talking about HTTP and TLS, the correct terminology is “HTTP over TLS,” the particulars of which are described in RFC 2818. Some people refer to it as HTTPS because HTTP/TLS URLs use “https” as the protocol identifier:

https://secure.shopping.com/basket

Proxies interact with HTTP/TLS traffic in one of two ways: either as a connection endpoint or as a device in the middle. If a proxy is an endpoint, it encrypts and decrypts the HTTP traffic. In this case, the proxy may be able to store and reuse responses. If, on the other hand, the proxy is in the middle, it can only tunnel the traffic between the two endpoints. Since the communication is encrypted, the responses cannot be cached and reused.

Gopher

The Gopher protocol is slowly becoming all but extinct on the Web. In principle, Gopher is very similar to HTTP/0.9. The client sends a single request line, to which the server replies with some content. The client knows a priori what type of content to expect because each request includes an encoded, Gopher-specific content-type parameter. The extra features offered by HTTP and HTML made Gopher obsolete.



[2] The source of this data is ftp://ftp.merit.edu/nsfnet/statistics/.

Get Web Caching now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.