Encode XML Documents

Character encoding is quite important, especially as XML documents cross international boundaries. This hack will help you understand and use character encoding in XML.

To understand XML, you need to understand the characters that can make up XML documents. XML 1.0 supports the UCS standard, officially ISO/IEC 10646-1:1993 Information technology—Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and Basic Multilingual Plane, and its seven amendments (search for 10646 on http://www.iso.ch). Since the time that XML became a recommendation at the W3C, UCS has advanced to ISO/IEC 10646-1:2000. In addition, Unicode is a parallel standard to UCS (see http://www.unicode.org). XML 1.0 supports Unicode Version 2.0, but Unicode has advanced to Version 4.0 at this time, so there are differences in what XML 1.0 supports and in what the latest versions of UCS and Unicode support.

Both ISO/IEC 10646-1 UCS and Unicode assign the same values and descriptions for each character; however, Unicode defines some semantics for the characters that ISO/IEC 10646-1 does not.

Tip

Mike Brown’s XML tutorial at http://www.skew.org/xml/tutorial is good background reading on Unicode and character sets. To look up general character charts, see Kosta Kostis’s charts at http://www.kostis.net/charsets/. For Unicode character charts, go to http://www.unicode.org/charts/.

Each character in Unicode is represented by a unique, hexadecimal (base-16) number. The first 128 characters ...

Get XML Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.