Detecting Unicode Storage Formats

The Unicode 2.0 standard, in its discussion of the byte order mark, described how it could be used not just to tell whether a Unicode file was the proper endian-ness, but whether it was a Unicode file at all. The idea is that the sequence 0xFE 0xFF (in Latin-1, a lowercase y with a diaeresis followed by the lowercase Icelandic letter “thorn”) would basically never be the first two characters of a normal ASCII/Latin-1 document. Therefore, you could look at something you knew was a text file and tell what it was: If the first two bytes were 0xFE 0xFF, it was Unicode; if the bytes were 0xFF 0xFE, it was byte-swapped Unicode; and if the bytes were anything else, it was whatever the default encoding for the system ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.