Chapter 27. Character Sets

By default, an XML parser assumes that XML documents are written in the UTF-8 encoding of Unicode. However, documents may be written instead in any character set the XML processor understands, provided that there’s either some external metadata like an HTTP header or internal metadata like a byte-order mark or an encoding declaration that specifies the character set. For example, a document written in the Latin-5 character set would need this XML declaration:

<?xml version="1.0" encoding="ISO-8859-9"?>

Most good XML processors understand many common character sets. The XML specification recommends the character names shown in Table 27-1. When using any of these character sets, you should use these names. Of these character sets, only UTF-8 and UTF-16 must be supported by all XML processors, although many XML processors support all character sets listed here, and many support additional character sets besides. When using character sets not listed here, you should use the names specified in the IANA character sets registry at http://www.iana.org/assignments/character-sets.

Table 27-1. Character set names defined by the XML specification

Name

Character set

UTF-8

The default encoding used in XML documents, unless an encoding declaration, byte-order mark, or external metadata specifies otherwise; a variable-width encoding of Unicode that uses one to four bytes per character. UTF-8 is designed such that all ASCII documents are legal UTF-8 documents, which is not ...

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.