UTF-8

Since every Unicode character is encoded in exactly two bytes, Unicode is a fairly simple encoding. The first two bytes of a file are the first character. The next two bytes are the second character, and so on. This makes parsing Unicode data relatively simple compared to schemes that use variable-width characters. The downside is that Unicode is far from the most efficient encoding possible. In a file containing mostly English text, the high bytes of almost all the characters will be 0. These bytes can occupy as much as half of the file. If you’re sending data across the network, Unicode data can take twice as long.

A more efficient encoding can be achieved for files that are composed primarily of ASCII text by encoding the more common characters in fewer bytes. UTF-8 is one such format that encodes the non-null ASCII characters in a single byte, characters between 128 and 2047 and ASCII null in two bytes, and the remaining characters in three bytes. While theoretically this encoding might expand a file’s size by 50%, because most text files contain primarily ASCII, in practice it’s almost always a huge savings. Therefore, Java uses UTF-8 in string literals, identifiers, and other text data in compiled byte code. UTF-8 is also a common encoding for XML files and the native encoding of Bell Labs’ experimental Plan 9 operating system.

To better understand UTF-8, consider a typical Unicode character as a sequence of 16 bits:

x15

x14

x13

x12

x11

x10

x9

x8

x7

x6

x5

Get Java I/O now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.