UTF-8

UTF-8 is the 8-bit Unicode encoding form. It was designed to allow Unicode to be used in places that support only 8-bit character encodings. A Unicode code point is represented using a sequence of anywhere from one to four 8-bit code units.

One vitally important property of UTF-8 is that it's 100 percent backward compatible with ASCII. That is, valid 7-bit ASCII text is also valid UTF-8 text. As a consequence UTF-8 can be used in any environment that supports 8-bit ASCII-derived encodings, and that environment will be able to correctly interpret and display the 7-bit ASCII characters. (The characters represented by byte values where the most significant bit is set, of course, aren't backward compatible—they have a different representation ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.