Combining Diacritical Marks

The Combining Diacritical Marks block contains characters that are not used on their own, such as the accent grave and circumflex. Instead, they are merged with the preceding character to form a single glyph. For example, to write the character Ñ, you could type the ASCII letter N followed by the combining tilde character, like this: Ñ. When rendered, this combination would produce the single glyph Ñ. In Table 27-11, the character to which the diacritical mark is attached is a dotted circle (Unicode code point &0x25CC;), but of course it could be any normal character.

For compatibility with legacy character sets, there is often more than one way to write accented characters. For example the letter é, e with accent acute, can either be written as the single character 0xE9 or as the letter e (0x65) followed by a combining accent acute (0x301). This can be a problem for naïve algorithms for searching, sorting, indexing, and performing other operations on text. It’s also an issue for XML. For instance, the <resumé> start-tag cannot be matched with a </resumé> end-tag if one uses character 0xE9 and the other uses 0x65 followed by 0x301. Where such multiple ways of writing the same character exist, the W3C strongly recommends using the precomposed form; that is, you should use the single character instead of the base character followed by a combining diacritical mark. In XML, these marks are primarily intended for forming characters that do not have ...

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.