3.1. Basic properties

3.1.1. Name

The name of a character is what we have called its description. The official list of the English names of characters according to their positions within the encoding appears in the following file:

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

This file contains a large amount of data in a format that is hard for humans to read but easy for computers: fifteen text fields separated by semicolons. Here are a few lines from this file:

   0020;SPACE;Zs;0;WS;;;;;N;;;;;
   0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
   0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
   0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
   0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;

The first two fields are the character's position (also called its "code point") and name (which we called its "description" in the previous chapter). These are fields number 0 and 1. (Counting begins at 0.) We shall see the other fields later.

Character names are not there solely for the benefit of humans; programming languages also understand them. In Perl, for example, to obtain the character that represents the letter 'D' of the Cherokee script, we can write \N{CHEROKEE LETTER A}, which is strictly equivalent to \x{13a0}, a reference to the character's code point.

3.1.2. Block and script

These properties refer to the distribution of the full set of characters according to the script to which they belong or to their functional similarity. Thus we have a block of Armenian characters (Armenian), but also a block of pictograms ...

Get Fonts & Encodings now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.