O'Reilly logo

RESTful Web Services by Sam Ruby, Leonard Richardson

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 9. The Building Blocks of Services

Throughout this book I’ve said that web services are based on three fundamental technologies: HTTP, URIs, and XML. But there are also lots of technologies that build on top of these. You can usually save yourself some work and broaden your audience by adopting these extra technologies: perhaps a domain-specific XML vocabulary, or a standard set of rules for exposing resources through HTTP’s uniform interface. In this chapter I’ll show you several technologies that can improve your web services. Some you’re already familiar with and some will probably be new to you, but they’re all interesting and powerful.

Representation Formats

What representation formats should your service actually send and receive? This is the question of how data should be represented, and it’s an epic question. I have a few suggestions, which I present here in a rough order of precedence. My goal is to help you pick a format that says something about the semantics of your data, so you don’t find yourself devising yet another one-off XML vocabulary that no one else will use.

I assume your clients can accept whatever representation format you serve. The known needs of your clients take priority over anything I can say here. If you know your data is being fed directly into Microsoft Excel, you ought to serve representations in Excel format or a compatible CSV format. My advice also does not extend to document formats that can only be understood by humans. If you’re serving audio files, I’ve got nothing to say about which audio format you should choose. To a first approximation, a programmed client finds all audio files equally unintelligible.


Media type: application/xhtml+xml

The common text/html media type is deprecated for XHTML. It’s also the only media type that Internet Explorer handles as HTML. If your service might be serving XHTML data directly to web browsers, you might want to serve it as text/html.

My number-one representation recommendation is the format I’ve been using in my own services throughout this book, and the one you’re probably most familiar with. HTML drives the human web, and XHTML can drive the programmable web. The XHTML standard (http://www.w3.org/TR/xhtml1/) relies on the HTML standard to do most of the heavy lifting (http://www.w3.org/TR/html401/).

XHTML is HTML under a few restrictions that make every XHTML document also valid XML. If you know HTML, you know most of what there is to know about XHTML, but there are some syntactic differences, like how to present self-closing tags. The tag names and attributes are the same: XHTML is expressive in the same ways as HTML. Since the XHTML standard just points to the HTML standard and then adds some restrictions to it, I tend to refer to “HTML tags” and the like except where there really is a difference between XHTML and HTML.

I don’t actually recommend HTML as a representation format, because it can’t be reliably parsed with an XML parser. There are many excellent and liberal HTML parsers, though (I mentioned a few in Chapter 2), so your clients have options if you can’t or don’t want to serve XHTML. Right now, XHTML is a better choice if you expect a wide variety of clients to handle your data.

HTML can represent many common types of data: nested lists (tags like ul and li), key-value pairs (the dl tag and its children), and tabular data (the table tag and its children). It supports many different kinds of hypermedia. HTML does have its shortcomings: its hypermedia forms are limited, and won’t fully support HTTP’s uniform interface until HTML 5 is released.

HTML is also poor in semantic content. Its tag vocabulary is very computer-centric. It has special tags for representing computer code and output, but nothing for the other structured fruits of human endeavor, like poetry. One resource can link to another resource, and there are standard HTML attributes (rel and rev) for expressing the relationship between the linker and the linkee. But the HTML standard defines only 15 possible relationships between resources, including “alternate,” “stylesheet,” “next,” “prev,” and “glossary.” See http://www.w3.org/TR/html401/types.html#type-links for a complete list.

Since HTML pages are representations of resources, and resources can be anything, these 15 relationships barely scratch the surface. HTML might be called upon to represent the relationship between any two things. Of course, I can come up with my own values for rel and rev to supplement the official 15, but if everyone does that confusion will reign: we’ll all pick different values to represent the same relationships. If I link my web page to my wife’s web page, should I specify my relationship to her as husband, spouse, or sweetheart? To a human it doesn’t matter much, but to a computer program (the real client on the programmable web) it matters a lot. Similarly, HTML can easily represent a list, and there’s a standard HTML attribute (class) for expressing what kind of list it is. But HTML doesn’t say what kinds of lists there are.

This isn’t HTML’s fault, of course. HTML is supposed to be used by people who work in any field. But once you’ve chosen a field, everyone who works in that field should be able to agree on what kinds of lists there are, or what kinds of relationships can exist between resources. This is why people have started getting together and adding standard semantics to XHTML with microformats.

XHTML with Microformats

Media type: application/xhtml+xml

Microformats are lightweight standards that extend XHTML to give domain-specific semantics to HTML tags. Instead of reinventing data storage techniques like lists, microformats use existing HTML tags like ol, span, and abbr. The semantic content usually lives in custom values for the attributes of the tags, such as class, rel, and rev. Example 9-1 shows an example: someone’s home telephone number represented in the microformat known as hCard.

Example 9-1. A telephone number represented in the hCard microformat

<span class="tel">
 <span class="type">home</span>:
 <span class="value">+1.415.555.1212</span>

Microformat adoption is growing, especially as more special-purpose devices get on the web. Any microformat document can be embedded in an XHTML page, because it is XHTML. A web service can serve an XHTML representation that contains microformat documents, along with links to other resources and forms for creating new ones. This document can be automatically parsed for its microformat data, or rendered for human consumption with a standard web browser.

As of the time of writing there were nine microformat specifications. The best-known is probably rel-nofollow, a standard value for the rel attribute invented by engineers at Google as a way of fighting comment spam on weblogs. Here’s a complete list of official microformats:


A way of representing events on a calendar or planner. Based on the IETF iCalendar format.


A way of representing contact information for people and organizations. Based on the vCard standard defined in RFC 2426.


A new value for the rel attribute, used when linking to the license terms for a XHTML document. For example:

<a href="http://creativecommons.org/licenses/by-nd/" rel="license">
 Made avaliable under a Creative Commons Attribution-NoDerivs license.

That’s standard XHTML. The only thing the microformat does is define a meaning for the string license when it shows up in the rel attribute.


A new value for the rel attribute, used when linking to URIs without necessarily endorsing them.


A new value for the rel attribute, used to label a web page according to some external classification system.


A new value for the rev attribute, an extension of the idea behind rel-nofollow. VoteLinks lets you say how you feel about the resource you’re linking to by casting a “vote.” For instance:

<a rev="vote-for" href="http://www.example.com">The best webpage ever.</a>
<a rev="vote-against" href="http://example.com/">
A shameless ripoff of www.example.com</a>

Stands for XHTML Friends Network. A new set of values for the rel attribute, for capturing the relationships between people. An XFN value for the rel attribute captures the relationship between this “person” resource and another such resource. To bring back the “Alice” and “Bob” resources from Relationships Between Resources” in Chapter 8, an XHTML representation of Alice might include this link:

<a rel="spouse" href="Bob">Bob</a>

Stands for XHTML Meta Data Profiles. A way of describing your custom values for XHTML attributes, using the XHTML tags for definition lists: DL, DD, and DT. This is a kind of meta-microformat: a microformat like rel-tag could itself be described with an XMDP document.


Stands (sort of) for Extensible Open XHTML Outlines. Uses XHTML’s list tags to represent outlines. There’s nothing in XOXO that’s not already in the XHTML standard, but declaring a document (or a list in a document) to be XOXO signals that a list is an outline, not just a random list.

Those are the official microformat standards; they should give you an idea of what microformats are for. As of the time of writing there were also about 10 microformat drafts and more than 50 discussions about possible new microformats. Here are some of the more interesting drafts:


A way of marking up latitude and longitude on Earth. This would be useful in the mapping application I designed in Chapter 5. I didn’t use it there because there’s still a debate about how to represent latitude and longitude on other planetary bodies: extend geo or define different microformats for each body?


A way of representing in XHTML the data Atom represents in XML.


A way of representing resumés.


A way of representing reviews, such as product reviews or restaurant reviews.


A way of representing bookmarks. This would make an excellent representation format for the social bookmarking application in Chapter 7. I chose to use Atom instead because it was less code to show you.

You get the idea. The power of microformats is that they’re based on HTML, the most widely-deployed markup format in existence. Because they’re HTML, they can be embedded in web pages. Because they’re also XML, they can be embedded in XML documents. They can be understood at various levels by human beings, specialized microformat processors, dumb HTML processors, and even dumber XML processors.

Even if the microformats wiki shows no microformat standard or draft for your problem space, you might find an open discussion on the topic that helps you clarify your data structures. You can also create your own microformat (see Ad Hoc XHTML” later in this chapter).


Media type: application/atom+xml

Atom is an XML vocabulary for describing lists of timestamped entries. The entries can be anything, but they usually contain pieces of human-authored text like you’d see on a weblog or a news site. Why should you use an Atom list instead of a regular XHTML list? Because Atom provides special tags for conveying the semantics of publishing: authors, contributors, languages, copyright information, titles, categories, and so on. (Of course, as I mentioned earlier, there’s a microformat called hAtom that brings all of these semantics into XHTML.) Atom is a useful XML vocabulary because so many web services are, in the broad sense, ways of publishing information. What’s more, there are a lot of web service clients that understand the semantics of Atom documents. If your web service is addressable and your resources expose Atom representations, you’ve immediately got a huge audience.

Atom lists are called feeds, and the items in the lists are called entries.


Some feeds are written in some version of RSS, a different XML vocabulary with similar semantics. All versions of RSS have the same basic structure as Atom: a feed that contains a number of entries. There are a number of variants of RSS but you shouldn’t have to worry about it at all. Today, every major tool for consuming feeds understands Atom.

These days, most weblogs and news sites expose a special resource whose representation is an Atom feed. The entries in the feed describe and link to other resources: weblog entries or news stories published on the site. You, the client, can consume these resources with a feed reader or some other external program. In Chapter 7, I represented lists of bookmarks as Atom feeds. Example 9-2 shows a simple Atom feed document.

Example 9-2. A simple Atom feed containing one entry

 <?xml version="1.0" encoding="utf-8"?>
   <feed xmlns="http://www.w3.org/2005/Atom">
     <title>RESTful News</title>
     <link rel="alternate" href="http://example.com/RestfulNews" />
     <author><name>Leonard Richardson</name></author>
     <contributor><name>Sam Ruby</name></contributor>

       <title>New Resource Will Respond to PUT, City Says</title>
       <link rel="edit" href="http://example.com/RestfulNews/104" />

        After long negotiations, city officials say the new resource
        being built in the town square will respond to PUT. Earlier
        criticism of the proposal focused on the city's plan to modify
        the resource through overloaded POST.
       <category scheme="http://www.example.com/categories/RestfulNews" 
                 term="local" label="Local news" />

In that example you can see some of the tags that convey the semantics of publishing: author, title, link, summary, updated, and so on. The feed as a whole is a joint project: it has an author tag and a contributor tag. It’s also got a link tag that points to an alternate URI for the underlying “feed” resource: the news site. The single entry has no author tag, so it inherits author information from the feed. The entry does have its own link tag, which points to http://www.example.com/RestfulNews/104. That URI identifies the entry as a resource in its own right. The entry also has a textual summary of the story. To get the remainder, the client must presumably GET the entry’s URI.

An Atom document is basically a directory of published resources. You can use Atom to represent photo galleries, albums of music (maybe a link to the cover art plus one to each track on the album), or lists of search results. Or you can omit the LINK tags and use Atom as a container for original content like status reports or incoming emails. Remember: the two reasons to use Atom are that it represents the semantics of publishing, and that a lot of existing clients can consume it.

If your application almost fits in with the Atom schema, but needs an extra tag or two, there’s no problem. You can embed XML tags from other namespaces in an Atom feed. You can even define a custom namespace and embed its tags in your Atom feeds. This is the Atom equivalent of XHTML microformats: your Atom feeds can use conventions not defined in Atom, without becoming invalid. Clients that don’t understand your tag will see a normal Atom feed with some extra mysterious data in it.


OpenSearch is one XML vocabulary that’s commonly embedded in Atom documents. It’s designed for representing lists of search results. The idea is that a service returns the results of a query as an Atom feed, with the individual results represented as Atom entries. But some aspects of a list of search results can’t be represented in a stock Atom feed: the total number of results, for instance. So OpenSearch defines three new elements, in the opensearch namespace:[28]


The total number of results that matched the query.


How many items are returned in a single “page” of search results.


If all the search results are numbered from zero to totalResults, then the first result in this feed document is entry number startindex. When combined with itemsPerPage you can use this to figure out what “page” of results you’re on.


Media type: image/svg+xml

Most graphic formats are just ways of laying pixels out on the screen. The underlying content is opaque to a computer: it takes a skilled human to modify a graphic or reuse part of one in another. Scalable Vector Graphics is an XML vocabulary that makes it possible for programs to understand and manipulate graphics. It describes graphics in terms of primitives like shapes, text, colors, and effects.

It would be a waste of time to represent a photograph in SVG, but using it to represent a graph, a diagram, or a set of relationships gives a lot of power to the client. SVG images can be scaled to arbitrary size without losing any detail. SVG diagrams can be edited or rearranged, and bits of them can be seamlessly snipped out and incorporated into other graphics. In short, SVG makes graphic documents work like other sorts of documents. Web browsers are starting to get support for SVG: newer versions of Firefox support it natively.

Form-Encoded Key-Value Pairs

Media type: application/x-www-form-urlencoded

I covered this simple format in Chapter 6. This format is mainly used in representations the client sends to the server. A filled-out HTML form is represented in this format by default, and it’s an easy format for an Ajax application to construct. But a service can also use this format in the representations it sends. If you’re thinking of serving comma-separated values or RFC 822-style key-value pairs, try form-encoded values instead. Form-encoding takes care of the tricky cases, and your clients are more likely to have a library that can decode the document.


Media type: application/json

JavaScript Object Notation is a serialization format for general data structures. It’s much more lightweight and readable than an equivalent XML document, so I recommend it for most cases when you’re transporting a serialized data structure rather than a hypermedia document.

I introduced JSON in JSON Parsers: Handling Serialized Data” in Chapter 2, and showed a simple JSON document in Example 2-11. Example 9-3 shows a more complex JSON document: a hash of lists.

Example 9-3. A complex data type in JSON format

{"a":["b","c"], "1":[2,3]}

As I show in Chapter 11, JSON has special advantages when it comes to Ajax applications. It’s useful for any kind of application, though. If your data structures are more complex than key-value pairs, or you’re thinking of defining an ad hoc XML format, you might find it easier to define a JSON structure of nested hashes and arrays.

RDF and RDFa

The Resource Description Framework is a way of representing knowledge about resources. Resource here means the same thing as in Resource-Oriented-Architecture: a resource is anything important enough to have a URI. In RDF, though, the URIs might not be http: URIs. Abstract URI schemas like isbn: (for books) and urn: (for just about anything) are common. Example 9-4 is a simple RDF assertion, which claims that the title of this book is RESTful Web Services.

Example 9-4. An RDF assertion

<span about="isbn:9780596529260" property="dc:title">
 RESTful Web Services

There are three parts to an RDF assertion, or triple, as they’re called. There’s the subject, a resource identifier: in this case, isbn:9780596529260. There’s the predicate, which identifies a property of the resource: in this case, dc:title. Finally there’s the object, which is the value of the property: in this case, “RESTful Web Services.” The assertion as a whole reads: “The book with ISBN 9780596529260 has a title of ‘RESTful Web Services.’”

I didn’t make up the isbn: URI space: it’s a standard way of addressing books as resources. I didn’t make up the dc:title predicate, either. That comes from the Dublin Core Metadata Initiative. DCMI defines a set of useful predicates that apply to published works like books and weblogs. An automated client that understands the Dublin Core can scan RDF documents that use those terms, evaluate the assertions they contain, and even make logical deductions about the data.

Example 9-4 looks a lot like an XHTML snippet, because that’s what it is. There are a couple ways of representing RDF assertions, and I’ve chosen to show you RDFa, a microformat-like standard for embedding RDF in XHTML. RDF/XML is a more popular RDF representation format, but I think it makes RDF look more complicated than it is, and it’s difficult to integrate RDF/XML documents into the web. RDF/A documents can go into XHTML files, just like microformat documents. However, since RDFa takes some ideas from the unreleased XHTML 2 standard, a document that includes it won’t be valid XHTML for a while. A third way of representing RDF assertions is eRDF, which results in valid XHTML.

RDF in its generic form is the basis for the W3C’s Semantic Web project. On the human web, there are no standards for how we talk about the resources we link to. We describe resources in human language that’s difficult or impossible for machines to understand. RDF is a way of constraining human speech so that we talk about resources using a standard vocabulary—not one that machines “understand” natively, but one they can be programmed to understand. A computer program doesn’t understand the Dublin Core’s “dc:title” any more than it understands “title.” But if everyone agrees to use “dc:title,” we can program standard clients to reason about the Dublin Core in consistent ways.

Here’s the thing: I think microformats do a good job of adding semantics to the web we already have, and they add less complexity than RDF’s general subject-predicate-object form. I recommend using RDF only when you want interoperability with existing RDF processors, or are treating RDF as a general-purpose microformat for representing assertions about resources.

One very popular use of RDF is FOAF, a way of representing information about human beings and the relationships between them.

Framework-Specific Serialization Formats

Media type: application/xml

I’m talking here about informal XML vocabularies used by frameworks like Ruby’s ActiveRecord and Python’s Django to serialize database objects as XML. I gave an example back in Example 7-4. It’s a simple data structure: a hash or a list of hashes.

These representation formats are very convenient if you happen to be writing a service that gives you access to one. In Rails, you can just call to_xml on an ActiveRecord object or a list of such objects. The Rails serialization format is also useful if you’re not using Rails, but you want your service to be usable by ActiveResource clients. Otherwise, I don’t really recommend these formats, unless you’re just trying to get something up and running quickly (as I am in Chapters 7 and 12). The major downside of these formats is that they look like documents, but they’re really just serialized data structures. They never contain hypermedia links or forms.


Media type: application/xhtml+xml

If none of the work that’s already been done fits your problem space... well, first, think again. Just as you should think again before deciding you can’t fit your resources into HTTP’s uniform interface. If you think your resources can’t be represented by stock HTML or Atom or RDF or JSON, there’s a good chance you haven’t looked at the problem in the right way.

But it’s quite possible that your resources won’t fit any of the representation formats I’ve mentioned so far. Or maybe you can represent most of your resource state with XHTML plus some well-chosen microformats, but there’s still something missing. The next step is to consider creating your own microformat.

The high-impact way of creating a microformat is to go through the microformat process, hammer it out with other microformat enthusiasts, and get it published as an official microformat. This is most appropriate when lots of people are trying to represent the same kind of data. Ideally, you’re in a situation where the human web is littered with ad hoc HTML representations of the data, and where there are already a couple of big standards that can serve as a model for a more agile microformat. This is how the hCard and hCalendar microformats were developed. There were many people trying to put contact information and upcoming events on the human web, and preexisting standards (vCard and iCalendar) to steal ideas from. The representation of “places on a map” that I devised in Chapter 5 might be a starting point for an official microformat. There are lots of mapping sites on the human web, and lots of heavyweight standards for representing GIS data. If I wanted to build a microformat, I’d have a lot of ideas to work from.

The low-impact way of creating a microformat is to add semantic content to the XHTML you were going to write anyway. This is suitable for representation formats that no one else is likely to use, or as a starting point so you can get a real web service running while you’re going through the microformat process. The representation of the list of planets from Chapter 5 works better as an ad hoc set of semantics than as an official microformat. All it’s doing is saying that one particular list is a list of planets.

The microformat design patterns and naming principles give a set of sensible general rules for adding semantics to HTML. Their advice is useful even if you’re not trying to create an official microformat. The semantics you choose for your “micromicroformat” won’t be standardized, but you can present them in a standard way: the way microformats do it. Here are some of the more useful patterns.

  • If there’s an HTML tag that conveys the semantics you want, use it. To represent a set of key-value pairs, use the dl tag. To represent a list, use one of the list tags. If nothing fits, use the span or div tag.

  • Give a tag additional semantics by specifying its class attribute. This is especially important for span and div, which have no real meaning on their own.

  • Use the rel attribute in a link to specify another resource’s relationship to this one. Use the rev attribute to specify this page’s relationship to another one. If the relationship is symmetric, use rel. See Hypermedia Technologies” later in this chapter for more on this.

  • Consider providing an XMDP file that describes your custom values for class, rel, and rev.

Other XML Standards and Ad Hoc Vocabularies

Media type: application/xml

In addition to XHTML, Atom, and SVG, there are a lot of specialized XML vocabularies I haven’t covered: MathML, OpenDocument, Chemical Markup Language, and so on. There are also specialized vocabularies you can use in RDF assertions, like Dublin Core and FOAF. A web service might serve any of these vocabularies as standalone representations, embed them into Atom feeds, or even wrap them in SOAP envelopes. If none of these work for you, you can define a custom XML vocabulary to represent your resource state, or maybe the parts that Atom doesn’t cover.

Although I’ve presented this as the last resort, that’s certainly not the common view. People come up with custom XML vocabularies all the time: that’s how there got to be so many of them. Almost every real web service mentioned in this book exposes its representations in a custom XML vocabulary. Amazon S3, Yahoo!’s search APs, and the del.icio.us API all serve representations that use custom XML vocabularies, even though they could easily serve Atom or XHTML and reuse an existing vocabulary.

Part of this is tech culture. The microformats idea is fairly new, and a custom XML vocabulary still looks more “official.” But this is an illusion. Unless you provide a schema definition for your vocabulary, your custom tags have exactly the same status as a custom value for the HTML “class” attribute. Even a definition does nothing but codify the vocabulary you made up: it doesn’t confer any legitimacy. Legitimacy can only come “from the consent of the governed”: from other people adopting your vocabulary.

That said, there is a space for custom XML vocabularies. It’s usually easy to use XHTML instead of creating your own XML tags, but it’s not so easy when you need tags with a lot of custom attributes. In that situation, a custom XML vocabulary makes sense. All I ask is that you seriously think about whether you really need to define a new XML vocabulary for a given problem. It’s possible that in the future, people will err in the opposite direction, and create ad hoc microformats when they shouldn’t. Then I’ll urge caution before creating a microformat. But right now, the problem is too many ad hoc XML vocabularies.

Encoding Issues

It’s a global world (I actually heard someone say that once), and any service you expose must deal with the products of people who speak different languages from you and use different writing systems. You don’t have to understand all of these languages, but to handle multilingual data without mangling it, you do need to know something about character encodings: the conventions that let us represent human-readable text as strings of bytes.

Every text file you’ve ever created has some character encoding, even though you probably never made a decision about which encoding to use (it’s usually a system property). In the United States the encoding is usually UTF-8, US-ASCII, or Windows-1252. In western Europe it might also be ISO 8859-1. The default for HTML on the web is ISO 8859-1, which is almost but not quite the same as Windows-1252. Japanese documents are commonly encoded with EUC-JP, Shift_JIS, or UTF-8. If you’re curious about what character encodings are used in different places, most web browsers list the encodings they understand. My web browser supports five different encodings for simplified Chinese, five for Hebrew, nine for the Cyrillic alphabet, and so on. Most of these encodings are mutually incompatible, even when they encode the same language. It’s insane!

Fortunately there is a way out of this confusion. We as a species have come up with Unicode, a way of representing every human writing system. Unicode isn’t a character encoding, but there are two good encodings for it: UTF-8 (more efficient for alphabetic languages like English) and UTF-16 (more efficient for logographic languages like Japanese). Either of these encodings can handle text written in any combination of human languages. The best single decision you can make when handling multilingual data is to keep all of your data in one of these encodings: probably UTF-8 unless you live or do a lot of business in east Asia, then maybe UTF-16 with a byte-order mark.

This might be as simple as making a decision when you start the project, or you may have to convert an existing database. You might have to install an encoding converter to work on incoming data, or write encoding detection code. (The Universal Encoding Detector is an excellent autodetection library for Python.) It’s got a Ruby port, available as the chardet gem. It might be easy or difficult. But once you’re keeping all of this data in one of the Unicode encodings, most of your problems will be over. When your clients send you data in a weird encoding, you’ll be able to convert it to your chosen UTF-* encoding. If they send data that specifies no format at all, you’ll be able to guess its encoding and convert it, or reject it as unintelligible.

The other half of the equation is communicating with your clients: how do you tell them which encoding you’re using in your outgoing representations? Well, XML lets you specify a character encoding on the very first line:

<?xml version="1.0" encoding="UTF-8"?>

All but one of my recommended representation formats is based on XML, so that solves most of the problem. But there is an encoding problem with that one outlier, and there’s a further problem in the relationship between XML and HTTP.

XML and HTTP: Battle of the encodings

An XML document can and should define a character encoding in its first line, so that the client will know how to interpret the document. An HTTP response can and should specify a value for the Content-Type response header, so that the client knows it’s being given an XML document and not some other kind. But the Content-type can also specify a document character encoding with “charset,” and this encoding might conflict with what it actually says in the document.

Content-Type: application/xml; charset="ebcdic-fr-297+euro"

<?xml version="1.0" encoding="UTF-8"?>

Who wins? Surprisingly, HTTP’s character encoding takes precedence over the encoding in the document itself.[29]If the document says “UTF-8” and Content-Type says “ebcdic-fr-297+euro,” then extended French EBCDIC it is. Almost no one expects this kind of surprise, and most programmers write code first and check the RFCs later. The result is that the character encoding, as specified in Content-Type, tends to be unreliable. Some servers claim everything they serve is UTF-8, even though the actual documents say otherwise.

When serving XML documents, I don’t recommend going out of your way to send a character encoding as part of Content-type. You can do it if you’re absolutely sure you’ve got the right encoding, but it won’t do much good. What’s really important is that you specify a document encoding. (Technically you can do without a document encoding if you’re using UTF-8, or UTF-16 with a byte-order mark. But if you have that much control over the data, you should be able to specify a document encoding.) If you’re writing a web service client, be aware that any character encoding specified in Content-Type may be incorrect. Use common sense to decide which encoding declaration to believe, rather than relying on a counterintuitive rule from an RFC a lot of people haven’t read.

Another note: when you serve XML documents, you should serve them with a media type of application/xml, not text/xml. If you serve a document as text/xml with no charset, the correct client behavior is to totally ignore the encoding specified in the XML document and interpret the XML document as US-ASCII.[30]Avoid these complications altogether by always serving XML as application/xml, and always specifying an encoding in the first line of the XML documents you generate.

The character encoding of a JSON document

I didn’t mention plain text in my list of recommended representation formats, mostly because plain text is not a structured format, but also because the lack of structure means there’s no way to specify the character encoding of “plain text.” JSON is a way of structuring plain text, but it doesn’t solve the character encoding problem. Fortunately, you don’t have to solve it yourself: just follow the standard convention.

RFC 4627 states that a JSON file must contain Unicode characters, encoded in one of the UTF-* encodings. Practically, this means either UTF-8, or UTF-16 with a byte-order mark. Plain US-ASCII will also work, since ASCII text happens to be valid UTF-8. Given this restriction, a client can determine the character encoding of a JSON document by looking at the first four bytes (the details are in RFC 4627), and there’s no need to specify an explicit encoding. You should follow this convention whenever you serve plain text, not just JSON.

[28] OpenSearch also defines a simple control flow: a special kind of resource called a “description document.” I’m not covering OpenSearch description documents in this book, mainly for space reasons.

[29] This is specified, and argued for, in RFC 3023.

[30] Again, according to RFC 3023, which few developers have read. For a lucid explanation of these problems, see Mark Pilgrim’s article “XML on the Web Has Failed” (http://www.xml.com/pub/a/2004/07/21/dive.html).

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required