Not all characters are available on the keyboard! This hack shows you how to represent such characters in an XML document by using decimal and hexadecimal character references, and how to represent entities by using entity references.
In XML, character and entity references are formed by surrounding a
numerical value or a name with
is a decimal character reference and
an entity reference. This hack shows you how to use both.
According to the third and latest edition of the XML 1.0 specification (http://www.w3.org/TR/REC-xml/), XML processors must accept over 1,000,000 hexadecimal characters (http://www.w3.org/TR/REC-xml/#charsets). It’s possible that you won’t be able to find all those characters on your keyboard! Don’t worry. You can use character references instead.
You can look up the semantics of individual Unicode characters at http://www.unicode.org/charts/.
You can reference characters using either decimal or hexadecimal numbers. Which one you use is a matter of style. The document Namen.xml uses both (Example 1-5); it contains some German names enclosed in German language tags.
Example 1-5. Namen.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet href="Namen.css" type="text/css"?> <Namen xml:lang="de"> <Name> <Vorname>Marie</Vorname> <Nachname>Müller</Nachname> <Geschlecht>♀</Geschlecht> </Name> <Name> <Vorname>Klaus</Vorname> <Nachname>Müller</Nachname> <Geschlecht>♂</Geschlecht> </Name> </Namen>
On lines 7 and 8 are the decimal character references
respectively. The first one refers to the letter u with an umlaut
(ü) and the second one is a female sign. Lines 12 and 13
use the hexadecimal character references
ü (ü) and
♂ (male sign), respectively. You can
see how these character references are rendered in Opera in Figure 1-6.
xml:lang attribute on line 4 is
a special language identification attribute in XML 1.0 (http://www.w3.org/TR/REC-xml/#sec-lang-tag).
de is a language identifier as defined
by RFC 3066 (http://www.ietf.org/rfc/rfc3066.txt) and ISO
639 (search http://www.iso.ch).
Other examples of language identifiers are
fr (French), and
XML has five predefined entities,
listed in Table 1-1. These predefined entities can
be used where the equivalent literal character is forbidden. For
example, an attribute value cannot contain a less-than sign
<), because it looks too much like the
beginning of a tag to an XML parser. No problem: you can use
< instead. Likewise, you cannot use an
ampersand in parsed character data, the text content of an element.
Why? Again, it looks like the beginning of a character or entity
reference to an XML parser. Again, no problem: you can use
Table 1-1. XML predefined entities
Less-than sign or open angle bracket (
Greater-than sign or close angle bracket (
Apostrophe or single quote (')
Quote or double quote (“)
The following document, copy.xml in Example 1-6, uses a predefined entity and also declares and references a new entity.
Example 1-6. copy.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet href="copy.css" type="text/css"?> <!DOCTYPE time [<!ENTITY copy "©">]> <!-- a time instant --> <time timezone="PST"> <hour>11</hour> <minute>59</minute> <second>59</second> <meridiem>p.m.</meridiem> <atomic signal="true"/> <copyright>© O'Reilly & Associates</copyright> </time>
copy is declared in the document type
declaration on line 3. The keyword is
is followed by the entity name
copy; and this is
followed by the value or content of the entity in quotes,
©“. (This entity comes standard in HTML
and XHTML.) Line 12 of this document references the entity declared
on line 3 (
©) and also references the XML
1.0 predefined entity for an ampersand
&). Open this document in Firefox (it is
styled by the CSS stylesheet copy.css) and it
will appear like Figure 1-7.
Character references provide a convenient means to access a very large number of characters. Entities [Hack #25] are also a convenient means to store information and access it elsewhere, even multiple times if necessary.