The Basics of Language-Sensitive String Comparison

The first thing to remember is that you can't simply rely on comparison of the numeric code point values when you're comparing two strings. Unless the strings that may be compared conform to a very tightly restricted grammar, this approach will always give you the wrong answer. (The one exception occurs when the ordering and equivalences implied by the comparison routine will have no user-visible effects, but even then you must worry about some wrinkles—see “Language-Insensitive String Comparison” later in this chapter.)

This isn't just a Unicode issue. Any binary comparison will give the wrong answers with most encodings. In fact, for every encoding standard, it's probably possible to come up ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.