2.2. Technical issues: characters and bytes

Even the philosophers say it: philosophy is not the only thing in life. And in the life of a Unicode user there are also issues of a strictly technical nature, such as the following: how are Unicode characters represented internally in memory? how are they stored on disk? how are they transmitted over the Internet? These are very important questions, for without memory, storage, and transmission there would be no information....

Those who have dealt with networks know that the transmission of information can be described by several layers of protocols, ranging from the lowest layer (the physical layer) to the highest (the application layer: HTTP, FTP, etc.). The same is true of Unicode: officially [347] five levels of representation of characters are distinguished. Here they are:

  1. An abstract character repertoire (or "ACR") is a set of characters—that is, a set of "descriptions of characters" in the sense used in the previous section—with no explicit indication of the position of each character in the Unicode table.

  2. A coded character set (or "CCS") is an abstract character repertoire to which we have added the "positions" or "code points" of the characters in the table. These are whole numbers between 0 and 0x10FFFF (= 1,114,111). We have not yet raised the issue of representing these code points in computers.

  3. A character encoding form (or "CEF") is a possible way to represent the code points of characters on computers. For example, to ...

Get Fonts & Encodings now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.