You are previewing HTTP: The Definitive Guide.

HTTP: The Definitive Guide

Cover of HTTP: The Definitive Guide by David Gourley... Published by O'Reilly Media, Inc.

Multilingual Character Encoding Primer

The previous section described how the HTTP Accept-Charset header and the Content-Type charset parameter carry character-encoding information from the client and server. HTTP programmers who do a lot of work with international applications and content need to have a deeper understanding of multilingual character systems to understand technical specifications and properly implement software.

It isn't easy to learn multilingual character systems—the terminology is complex and inconsistent, you often have to pay to read the standards documents, and you may be unfamiliar with the other languages with which you're working. This section is an overview of character systems and standards. If you are already comfortable with character encodings, or are not interested in this detail, feel free to jump ahead to Section 16.4.

Character Set Terminology

Here are eight terms about electronic character systems that you should know:


An alphabetic letter, numeral, punctuation mark, ideogram (as in Chinese), symbol, or other textual "atom" of writing. The Universal Character Set (UCS) initiative, known informally as Unicode,[3] has developed a standardized set of textual names for many characters in many languages, which often are used to conveniently and uniquely name characters.[4]


A stroke pattern or unique graphical shape that describes a character. A character may have multiple glyphs if it can be written different ways (see Figure 16-3).

The best content for your career. Discover unlimited learning on demand for around $1/day.