Character References

Unicode contains more than 96,000 different characters covering almost all of the world’s written languages. Predefining entity references for each of these characters, most of which will never be used in any one document, would impose an excessive burden on XML parsers. Rather than pick and choose which characters are worthy of being encoded as entities, XML goes to the other extreme. It predefines entity references only for characters that have special meaning as markup in an XML document: <, >, &, “, and ‘. All these are ASCII characters that are easy to type in any text editor.

For other characters that may not be accessible from an ASCII text editor, XML lets you use character references. A character reference gives the number of the particular Unicode character it stands for, in either decimal or hexadecimal. Decimal character references look like &#1114;; hexadecimal character references have an extra x after the &#;; that is, they look like &#x45A;. Both of these references refer to the same character, њ , the Cyrillic small letter “nje” used in Serbian and Macedonian. For example, suppose you want to include the Greek maxim "σ ο φÓς ε α υ τÓ ν γ ι γ ν ω σ κ ε ι" (“The wise man knows himself”) in your XML document. However, you only have an ASCII text editor at your disposal. You can replace each Greek letter with the correct character reference, like this:

<maxim> &#x3C3;&#x3BF;&#x3C6;&#x3CC;&#x3C2; &#x3AD;&#x3B1;&#x3C5;&#x3C4;&#x3CC;&#x3BD; &#x3B3;&#x3B9;&#x3B3;&#x3BD;&#x3CE;&#x3C3;&#x3BA;&#x3B5;&#x3B9; ...

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.