Handling Non-English Characters

ASCII only allows a set of 256 characters to be used to describe the alphanumeric characters available to print. That range, 0 to 255, is used because it is the size of a byte—8 ones and zeros, in computing terminology. Languages such as Chinese, Korean, and Japanese have special characters in them, which means you need more than 256 characters, and therefore need more than one byte of space—you need a multibyte character. The multibyte character implementation in PHP is capable of working with Unicode-based encodings, such as UTF-8; however, at this time, Unicode support in PHP is very weak. Full Unicode support is currently one of the key goals for future releases of PHP.

Dealing with these complex characters is slightly different from working with normal characters, because functions like substr() and strtoupper() expect precisely one byte per character and will corrupt a multibyte string. Instead, you should use the multibyte equivalents of these functions, such as mb_strtoupper() instead of strtoupper(), mb_ereg_match() rather than ereg_match(), and mb_strlen() rather than strlen(). The parameters required for these functions are the same as their originals, except that most accept an optional extra parameter to force specific encoding.

If there is an existing script that you'd like to multibyte-enable, there's a special php.ini setting you can change: mbstring.func_overload. By default, this is set to 0, which means functions behave as you ...

Get PHP in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.