Name

Characters

Synopsis

XML documents are inherently text documents, which are composed of characters. To ensure that documents are portable across disparate computer systems and can contain content in as many written human languages as possible, XML parsers are required to implement the Unicode standard. This does not mean that all XML documents must be saved and edited in Unicode, but it does mean that the XML parser must be able to convert your document from its native character encoding to Unicode. All XML parsers are required to support (as a minimum) either UTF-8 or UTF-16 as input encoding formats. For more information on encoding formats and Unicode, see Chapter 27.

Tip

One of the primary differences between XML 1.0 and XML 1.1 is the definition of which Unicode characters are valid within an XML document. In XML 1.0, many of the ASCII control characters (such as BEL and NAK) were explicitly disallowed within XML documents. XML 1.1 permits any Unicode character these 60 control characters (except for null, x0000) as long as they’re escaped with numeric character references. XML 1.1 also requires that the C1 controls between 0x0080 and 0x009F be escaped with numeric character references, which XML 1.0 does not require.

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.