Standard Compression Scheme for Unicode

One of the major reasons for resistance to Unicode when it first came out was the idea of text files taking up twice as much room as before to store the same amount of actual information. For languages such as Chinese and Japanese that were already using two bytes per character, this issue wasn't a problem. Nevertheless, the idea of using two bytes per character for the Latin alphabet was anathema to a lot of people.

The concern is certainly legitimate: The same document takes up twice as much space on a disk and twice as long to send over a communications link. A database column containing text takes up twice as much disk space. In an era of slow file downloads, for example, the idea of waiting twice as ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.