Canonical Accent Ordering

Combining characters let you attach an arbitrary number of combining marks to a base character, leading to arbitrarily long combining character sequences. As many languages permit a single base character to have at least two diacritical marks attached to it, these combinations do occur in practice. In some cases, even precomposed characters may include multiple diacriticals.

One of the crazy things about having precomposed characters and combining character sequences is that both can be used together to represent the same character. Consider the letter o with a circumflex on top and a dot beneath, which occurs in Vietnamese. This letter has five possible representations in Unicode:

 U+006F LATIN SMALL LETTER O U+0302 ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.