Cover Page by Gary Cornell, Cay S. Horstmann

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

5.4.2. Decomposition

Occasionally, a character or sequence of characters can be described in more than one way in Unicode. For example, an “Å” can be Unicode character U+00C5, or it can be expressed as a plain A (U+0065) followed by a ° (“combining ring above”; U+030A). Perhaps more surprisingly, the letter sequence “ffi” can be described with a single character “Latin small ligature ffi” with code U+FB03. (One could argue that this is a presentation issue that should not have resulted in different Unicode characters, but we don’t make the rules.)

The Unicode standard defines four normalization forms (D, KD, C, and KC) for strings. See www.unicode.org/unicode/reports/tr15/tr15-23.html for the details. Two of them are used for collation. In the ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required