ASCII, Unicode, and the Universal Character System

The actual characters in documents are stored as numeric codes. The most common code set today is the American Standard Code for Information Interchange (ASCII). ASCII codes extend from 0 to 255 (to fit within a single byte); for example, the ASCII code for "A" is 65, the ASCII code for "B" is 66, and so on.

On the other hand, the World Wide Web is just that today: worldwide. Plenty of scripts are not handled by ASCII, such as scripts in Bengali, Armenian, Hebrew, Thai, Tibetan, Japanese Katakana, Arabic, Cyrillic, and other languages.

For that reason, the default character set specified for XML by W3C is Unicode, not ASCII. Unicode codes are made up of 2 bytes, not 1, so they extend from 0 ...

Get Inside XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.