Other Encodings

Although Unicode is the most advanced and comprehensive character set yet designed on this planet, it has not taken the world by storm. Compared to the vast quantities of ASCII data, there are virtually no Unicode files on today’s computers. Although Unicode support is growing, there will doubtless be legacy data in other encodings that must be read for centuries to come. A lot of it is in the Unicode subsets ASCII and ISO Latin-1, but a lot of it is also in less popular encoding schemes like EBCDIC and MacRoman. Those only cover English and a few Western European languages. There are multiple encodings in use for Arabic, Turkish, Hebrew, Greek, Cyrillic, Chinese, Japanese, Korean, and many other languages and scripts. The Reader and Writer classes (discussed in the next chapter) allow you to read and write data in these different character sets. The String class also has a number of methods that convert between different encodings (though a String object itself is always represented in Unicode). Furthermore, the JDK includes a character mode tool based on these classes called native2ascii that performs such conversions on existing files.

The name native2ascii is a misnomer. Rather than converting to ASCII, it converts to ISO Latin-1 with Unicode characters embedded with Unicode escape sequences like \u020F. It can also work in reverse, converting an ISO Latin-1 file with embedded Unicode to a native character set. For example, to copy the contents of the file macdata.txt ...

Get Java I/O now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.