Character-Set Metadata
Some environments keep track of which encodings particular documents are written in. For instance, web servers that transmit XML documents precede them with an HTTP header that looks something like this:
HTTP/1.1 200 OK
Date: Sun, 28 Oct 2001 11:05:42 GMT
Server: Apache/1.3.19 (Unix) mod_jk mod_perl/1.25 mod_fastcgi/2.2.10 Connection: close
Transfer-Encoding: chunked
Content-Type: text/xml; charset=iso-8859-1
The Content-Type field of the HTTP header provides the MIME media type of the document. This may, as shown here, specify which character set the document is written in. An XML parser reading this document from a web server should use this information to determine the document’s character encoding.
Many web servers omit the charset
parameter from the MIME media type.
In this case, if the MIME media type is text/xml
, then the document is assumed to be in the US-ASCII
encoding. If the MIME media type is application/xml
, then the parser attempts to
guess the character set by reading the first few bytes of the
document.
Tip
Since ASCII is almost never an appropriate character set for
an XML document, application/xml
is much preferred over text/xml
.
Unfortunately, most web servers including Apache 2.0.36 and earlier are configured to use
text/xml
by default. If you’re
running such a version you should probably upgrade before serving
XML files.[1]
We’ve focused on MIME types in HTTP headers because that’s the most common place where character set metadata is ...
Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.