7.1. Character Sets and Encodings

Computers don't understand letters or symbols of any kind; numbers are all they know. Every file, whether a spreadsheet, letter, or XML document, is really just a long string of binary digits inside the computer. The data is encoded, meaning that every symbol is represented by a unique number in the file. Software translates the characters you type on the keyboard into these numerical codes, and another program translates them back into human-recognizable text.

An example of this process is Morse code. To transmit text over wires, a telegraph operator breaks down the text into individual letters, numbers, and symbols. She translates each of these into its unique Morse equivalent, a series of short and long signals, and transmits the message over the wire. On the receiving end, another operator translates the code back into text and scribbles the message onto a notepad. Sending email works in a similar fashion: you type in the message with a keyboard, software translates the keystrokes into numbers, the sequence is sent through the network to its destination, and the numbers are converted back into text and displayed on the recipient's screen.

The mapping of characters to numerical values creates a character set. The term character describes any piece of text or signal that can be represented in a single position in the character set. For example, the letter "Q" from the Latin alphabet is a single character, as is its lowercase cousin "q". ...

Get Learning XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.