Chapter 19. Character Sets and Unicode

We live on a planet on which many languages are spoken. I can walk out my front door in Brooklyn and hear people conversing in English, French, Creole, Hebrew, Arabic, Spanish, and languages I don’t even recognize. The Internet is even more diverse than Brooklyn. A local doctor’s office that sets up a storefront on the Web to sell vitamins may soon find itself shipping to customers whose native languages are Chinese, Gujarati, Turkish, German, Portuguese, or something else. There’s no such thing as a local business on the Internet.

However, the first computers and the first programming languages were mostly designed by English-speaking programmers in countries where English was the native language. These programmers designed character sets that worked well for English text, though not much else. The preeminent such set is ASCII. Since ASCII is a 7-bit character set, each ASCII character can be represented as a single byte, signed or unsigned. Thus, it’s natural for ASCII-based programming languages, such as C, to equate the character data type with the byte data type. In these languages, the same operations that read and write bytes also read and write characters.

Unfortunately, ASCII is inadequate for almost all non-English languages. It contains no cedillas, umlauts, betas, thorns, or any of the other thousands of non-English characters used around the world. Fairly shortly after the development of ASCII there was an explosion of extended character ...

Get Java I/O, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.