Language-Sensitive Comparison on Unicode Text

To the previously mentioned considerations, which you have to deal with regardless of which encoding standard you use to encode your characters, Unicode adds a few more interesting complications.

Unicode Normalization

Unlike in most other encoding schemes, many characters and sequences of characters have multiple legal representations in Unicode. One of the requirements of supporting Unicode is that (provided you support all of the characters involved) all representations of a character be treated as equal. Thus, whether you represent “ä” with

U+00E4 LATIN SMALL LETTER A WITH DIAERESIS

or

U+0061 LATIN SMALL LETTER A
U+0308 COMBINING DIAERESIS

it should look and behave the same way everywhere. ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.