16.11. Reading or Writing Unicode Characters

Problem

You want to read Unicode-encoded characters from a file, database, or form; or, you want to write Unicode-encoded characters.

Solution

Use utf8_encode( ) to convert single-byte ISO-8859-1 encoded characters to UTF-8:

print utf8_encode('Kurt Gödel is swell.');

Use utf8_decode( ) to convert UTF-8 encoded characters to single-byte ISO-8859-1 encoded characters:

print utf8_decode("Kurt G\xc3\xb6del is swell.");

Discussion

There are 256 possible ASCII characters. The characters between codes 0 and 127 are standardized: control characters, letters and numbers, and punctuation. There are different rules, however, for the characters that codes 128-255 map to. One encoding is called ISO-8859-1, which includes characters necessary for writing most European languages, such as the ö in Gödel or the ñ in pestaña. Many languages, though, require more than 256 characters, and a character set that can express more than one language requires even more characters. This is where Unicode saves the day; its UTF-8 encoding can represent more than a million characters.

This increased functionality comes at the cost of space. ASCII characters are stored in just one byte; UTF-8 encoded characters need up to four bytes. Table 16-2 shows the byte representations of UTF-8 encoded characters.

Table 16-2. UTF-8 byte representation

Character code range	Bytes used	Byte 1	Byte 2	Byte 3	Byte 4
`0x00000000 - 0x0000007F`	1	`0xxxxxxx`
`0x00000080 - ...`

Get PHP Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

PHP Cookbook by

16.11. Reading or Writing Unicode Characters

Problem

Solution

Discussion

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly