Other Encodings
Although
Unicode
is the most advanced and comprehensive character set yet designed on
this planet, it has not taken the world by storm. Compared to the
vast quantities of ASCII data, there are virtually no Unicode files
on today’s computers. Although
Unicode support is growing, there
will doubtless be legacy data in other encodings that must be read
for centuries to come. A lot of it is in the Unicode subsets ASCII
and ISO Latin-1, but a lot of it is also in less popular encoding
schemes like EBCDIC
and MacRoman. Those
only cover English and a few Western European languages. There are
multiple encodings in use for Arabic, Turkish, Hebrew, Greek,
Cyrillic, Chinese, Japanese, Korean, and many other languages and
scripts. The Reader
and Writer
classes (discussed in the next chapter) allow you to read and write
data in these different character sets. The String
class also has a number of methods that convert between different
encodings (though a String
object itself is always
represented in Unicode). Furthermore,
the JDK includes a
character mode tool based on these classes called
native2ascii
that performs such conversions on
existing files.
The name native2ascii
is a misnomer. Rather than
converting to ASCII, it converts to ISO Latin-1 with Unicode
characters embedded with Unicode escape sequences like
\u020F
. It can also work in reverse, converting an
ISO Latin-1 file with embedded Unicode to a native character set. For
example, to copy the contents of the file
macdata.txt ...
Get Java I/O now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.