ASCII, Unicode, and the Universal Character System

The actual characters in documents are stored as numeric codes, and today the most common code set is the American Standard Code for Information Interchange (ASCII). ASCII codes extend from 0 to 127; for example, the ASCII code for A is 65, the ASCII code for B is 66, and so on.

On the other hand, the World Wide Web is just that today—worldwide. And plenty of scripts are not handled by ASCII, including Bengali, Armenian, Hebrew, Thai, Tibetan, Japanese Katakana, Arabic, and Cyrillic.

For that reason, the default character set specified for XML by W3C is Unicode, not ASCII. Unicode codes are made up of 2 bytes, not 1, so they extend from 0 to 65,535 instead of just 0 to 255 (however, to make things ...

Get Real World XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.