Chapter 4. i18n, L10n, and Unicode

Internationalization, localization, and Unicode are all hot topics in the field of modern web application development. If you build and launch an application without support for multiple languages, you’re going to be missing out on a huge portion of your possible user base. Current research suggests that there are about 510 million English-speaking people in the world. If your application only caters to English speakers, you’ve immediately blocked 92 percent of your potential global audience. These numbers are actually wildly inaccurate and generally used as a scare tactic; you have to consider how many of the world’s six billion or so population is online to begin with. But even once we factor this in, we are still left with 64 percent of online users (around 680 million people) who don’t speak English (these statistics come from the global reach web site: http://global-reach.biz/). That’s still a huge number of potential users you’re blocking from using your application.

Addressing this problem has historically been a huge deal. Developers would need advanced knowledge of character sets and text processing, language-dependent data would need to be stored separately, and data from one group of users could not be shared with another. But in a world where the Internet is becoming more globally ubiquitous, these problems needed solving. The solutions that were finally reached cut out a lot of the hard work for developers—it’s now almost trivially easy to create a multilanguage application, with only a few simple bits of knowledge.

This chapter will get you quickly up to speed with the issues involved with internationalization and localization, and suggest simple ways to solve them. We’ll then look at Unicode in detail, explaining what it is, how it works, and how you can implement full Unicode applications quickly and easily. We’ll touch on the key areas of data manipulation in web applications where Unicode has a role to play, and identify the potential pitfalls associated with them.

Internationalization and Localization

Internationalization and localization are buzzwords in the web applications field—partly because they’re nice long words you can dazzle people with, and partly because they’re becoming more important in today’s world. Internationalization and localization are often talked about as a pair, but they mean very distinct things, and it’s important to understand the difference:

  • Internationalization is adding to an application the ability to input, process, and output international text.

  • Localization is the process of making a customized application available to a specific locale.

Internationalization is often shortened to i18n (the “18” representing the 18 removed letters) and localization to L10n (for the same reason, although an uppercase “L” is used for visual clarity) and we’ll refer to them as such from this point on, if only to save ink. As with most hot issues, there are a number of other terms people have associate with i18n and L10n, which are worth pointing out if only to save possible confusion later on: globalization (g11n) refers to both i18n and L10n collectively, while personalization (p13n) and reach (r3h) refer solely to L10n.

Internationalization in Web Applications

Way back in the distant past (sometimes referred to as the 90s), having internationalization support meant that your application could input, store, and output data in a number of different character sets and encodings. Your English-speaking audience would converse with you in Latin-1, your Russian speakers in KOI8-R, your Japanese users in Shift_JIS, and so on. And all was well, unless you wanted to present data from two different user sets on the same page. Each of these character sets and encodings allowed the representation and encoding of a defined set of characters—usually somewhere between 100 and 250. Sometimes some of these characters would overlap, so you could store and display the character Ю (Cyrillic capital letter Yu) in both KOI8-Ukrainian as the byte 0xE0 and Apple Cyrillic as the byte 0x9E. But more often than not, characters from one character set weren’t displayable in another. You can represent the character ね (Hiragana letter E) in IBM-971-Korean, and the character Ų (Latin capital letter U with Ogonek) in IBM-914-Baltic, but not vice versa.

And as lovely as these different character set were, there were additional problems beyond representing each other’s characters. Every piece of stored data needed to be tagged with the character set it was stored as. Any operations on a string had to respect the character set; you can’t perform a byte-based sub string operation on a shift-JIS string. When you came to output a page, the HTML and all the content had to be output using the correct character set.

At some point, somebody said enough was enough and he wouldn’t stand for this sort of thing any more. The solution seemed obvious—a single character set and encoding to represent all the characters you could ever wish to store and display. You wouldn’t have to tag strings with their character sets, you wouldn’t need many different string functions, and you could output pages in just one format. That sounded like a neat idea.

And so in 1991, Unicode was born. With a character set that spanned the characters of every known written language (including all the diacritics and symbols from all the existing character sets) and a set of fancy new encodings, it was sure to revolutionize the world. And after wallowing in relative obscurity for about 10 years, it finally did.

In this chapter, we’re going to deal solely with the Unicode character set and the UTF-8 encoding for internationalization. It’s true that you could go about it a different way, using either another Unicode encoding or going down the multiple character set path, but for applications storing primarily English data, UTF-8 usually makes the most sense. For applications storing a large amount of CJKV data—or any data with many high numbered code points—UTF-16 can be a sensible choice. Aside from the variable length codepoint representations, the rest of this chapter applies equally well to UTF-16 as it does to UTF-8.

Localization in Web Applications

The localization of web applications is quite different from internationalization, though the latter is a prerequisite for the former. When we talk about localizing a web application, we mean presenting the user with a different interface (usually just textually) based on their preferred locale.

There are a few of methods of localizing your site, none of which are very easy. This chapter deals primarily with internationalization, so we won’t go into a lot of localization detail. We’ll look briefly at three approaches you can take toward localization before we get back to internationalization basics.

String substitution

At the lowest level, you can use a library like GNU’s gettext (http://www.gnu.org/software/gettext/), which allows you to substitute languages at the string level. For instance, take this simple piece of PHP code, which should output a greeting:

    printf("Hello %s!", $username);

Using a gettext wrapper function, we can substitute any language in there:

    printf(_("Hello %s!"), $username);

The only change is to call the gettext function, which is called _(), passing along the English string. The gettext configuration files then contain a mapping of phrases into different languages, as shown in Examples 4-1 and 4-2.

Example 4-1. my_app.fr.po
msgid "Hello %s!"
msgstr "Bonjour %s!"
Example 4-2. my_app.ja.po
msgid "Hello %s!"
msgstr "
                     
                     
                     
                      %s!"

At runtime, gettext returns the correct string, depending on the user’s desired locale, which your application then outputs.

The problem with string substitution is that any time you change any of your application’s visual structure (changing a flow, adding an explanation, etc.), you need to immediately update every translation. This is all very well if you have a team of fulltime translators on staff, but needing full translations before deploying any changes doesn’t sit well with rapid web application development.

Multiple template sets

In an application where markup is entirely separated from any page logic, the templates act as processing-free documents (aside from conditionals and iterations). If you create multiple sets of templates, one in each locale you wish to support, then development in one set doesn’t have to be synchronous with development in another. You can make multiple ongoing changes to your master locale, and then periodically batch those changes over to you other locales.

This approach does have its problems, though. Although the changes in markup can happen independently of any changes in page logic, any functional changes in page logic need to be reflected in the markup and copy. For instance, if a page in your application showed the latest five news items but was being changed to show a random selection of stories instead, then the copy saying so would have to be updated for all languages at once. The alternative is to not change the functionality for the different locales by supporting multiple functionalities simultaneously in the page logic and allowing it to be selected by the templates. This starts to get very complicated, putting multiple competing flows into the page logic layer.

Multiple frontends

Instead of smushing multiple logical flows into the page logic layer, you can instead create multiple page logic layers (including the markup and presentation layers above them). This effectively creates multiple sites on top of a single common storage and business logic layer.

By building your application’s architecture around the layered model and exposing the business logic functions via an API (skip ahead to Chapter 11 for more information), you can initially support a single locale and then build other locale frontends later at your own pace. An internationalized business logic and storage layer then allows the sharing of data between locales—the data added via the Japanese locale application frontend can be seen in the Spanish locale application frontend.

For more general information about i18n and L10n for the Web, you should visit the W3C’s i18n portal at http://www.w3.org/International/.

Unicode in a Nutshell

When talking about Unicode, many people have preconceived ideas of what it is and what it means for software development. We’re going to try to dispel these myths, so to be safe we’ll start from the basic principles with a clean slate. It’s hard to get much more basic than Figure 4-1.

The letter “a”
Figure 4-1. The letter “a”

So what is this? It’s a lowercase Latin character “a.” Well, really it’s a pattern of ink on paper (or pixels on the screen, depending on your medium) representing an agreed on letter shape. We’ll refer to this shape as a glyph. This is only one of many glyphs representing the lowercase Latin character “a.” Figure 4-2 is also a glyph.

A different glyph
Figure 4-2. A different glyph

It’s a different shape with different curves but still represents the same character. Okay, simple so far. Each character has multiple glyph representations. At a computer level, we call these glyph sets fonts . When we need to store a sequence of these characters digitally, we usually store only the characters, not the glyphs themselves. We can also store information telling us which font to use to render the characters into glyphs, but the core information we’re storing is still a sequence of characters.

So how do we make the leap from a lowercase Latin character “a” to the binary sequence 01100001? We need two sets of mappings (although they’re often grouped together into one set). The first, a character set, tells us how to take abstract characters and turn them into numbers. The second, an encoding, tells us how to take these numbers (or code points) and represent those using bits and bytes. So let’s revisit; what is Figure 4-3?

The question of the letter “a” again
Figure 4-3. The question of the letter “a” again

It’s a glyph representing the lowercase Latin character “a.” The ASCII character set tells us that the lowercase Latin character “a” has a code point of 0x61 (97 in decimal). The ASCII encoding then tells us that we can represent code point 0x61 by using the single byte 0x61.

Unicode was designed to be very compatible with ASCII, and so all ASCII code points are the same in Unicode. That is to say, the Latin lowercase “a” in Unicode also has the code point 0x61. In the UTF-8 encoding, code point 0x61 is represented by the single byte 0x61, just as with ASCII. In the UTF-16 encoding, code point 0x61 is represented as the pair of bytes 0x00 and 0x61. We’ll take a look at some of the different Unicode encodings shortly.

So why do we want to use Unicode, since it looks so similar to ASCII? This is most easily answered with an example of something ASCII can’t do, which is represent the character shown in Figure 4-4.

A character well outside of ASCII
Figure 4-4. A character well outside of ASCII

This is the Bengali Vocalic RR character, Unicode code point U+09E0. In the UTF-8 encoding scheme, this code point maps to the bytes 0xE0 0xA7 0xA0. In the UTF-16 encoding, the same code point would be encoded using the pair of bytes 0x09 0xE0.

Unicode Encodings

There are a number of encodings defined for storing Unicode data, both fixed and variable width. A fixed-width encoding is one in which every code point is represented by a fixed number of bytes, while a variable-length encoding is one in which different characters can be represented by different numbers of bytes. UTF-32 and UCS2 are fixed width, UTF-7 and UTF-8 are variable width, and UTF-16 is a variable width encoding that usually looks like a fixed width encoding.

UTF-32 (and UCS4, which is almost the same thing) encodes each code point using 4 bytes, so it can encode any code point from U+0000 to U+FFFFFFFF. This is usually overkill, given that there aren’t nearly that many code points defined. UCS2 encodes each code point using 2 bytes, so it can encode any code point from U+0000 to U+FFFF. UTF-16 also uses 2 bytes for most characters, but the code points from U+D800 to U+DFFF are used in what’s called surrogate pairs , which allows UTF-16 to encode the code points U+0000 to U+10FFFF.

UTF-8 uses between 1 and 4 (or 1 and 7 for the ISO 10646 version, which we’ll discuss below) bytes for each code point and can encode code points U+0000 to U+10FFFF (or U+0000 to U+3FFFFFFFFFF for the ISO 10646 version). We’ll discuss UTF-8 in more detail in a moment. UTF-7 is a 7-bit safe encoding that allows it to appear in emails without the need for base64 or quoted-printable encoding. UTF-7 never really caught on, and isn’t widely used since it lacks UTF-8’s ASCII transparency, and quoted-printable is more than adequate for sending UTF-8 by email.

So what’s this ISO 10646 thing we’ve been talking about? The concept of Unicode was obviously such a good idea that two groups started working on it at the same time—the Unicode Consortium and the International Organization for Standardization (ISO). Before release, the standards were combined but still retained separate names. They are kept mostly in sync as time goes on, but have different documentation and diverge a little when it comes to encodings. For the sake of clarity, we’ll treat them as the same standard.

What’s important to notice here is that while we have multiple encodings (which map code points to bytes), we only have a single character set (which maps characters to code points). This is central to the idea of Unicode—a single set of code points that all applications can use, with a set of multiple encodings to allow applications to store data in whatever way they see fit. All Unicode encodings are lossless, so we can always convert from one to another without losing any information (ignoring the fact that UTF-16 can’t represent many private-use code points that UTF-32 can). With Unicode, code point U+09E0 always means the Bengali Vocalic RR character, regardless of the encoding used to store it.

Code Points and Characters, Glyphs and Graphemes

So far we’ve painted a fairly complex picture—characters are symbols that have an agreed meaning and are represented by a code point. A code point can be represented by one or more bytes using an encoding. If only it were so simple.

A character doesn’t necessarily represent what a human thinks of as a character. For instance, the Latin letter “a” with a tilde can be represented by either the code point U+00E3 (Latin small letter “a” with tilde) or by composing it from two code points, U+0061 (Latin small letter “a”) and U+0303 (combining tilde). This composed form is referred to as a grapheme. A grapheme can be composed of one of more characters—a base character and zero or more combining characters.

The situation is further complicated by ligatures, in which a single glyph can be constructed from two or more characters. These characters are then represented by a single code point, or by two regular code points. For instance, the ligature fi (“f” followed by “i”) can be represented by U+0066 (Latin small letter “f”) and U+0131 (Latin small letter dotless “i”), or by U+FB01 (Latin small ligature “?”).

So what does this mean at a practical level? It means that given a stream of code points, you can’t arbitrarily cut them (such as with a substring function) and get the expected sequence of graphemes. It also means that there is more than one way to represent a single grapheme, using different sequences of ligatures and combining characters to create identical graphemes (although the Unicode normalization rules allow for functional decomposed grapheme comparison). To find the number of characters (the length) in a string encoded using UTF-8, we can’t count the bytes. We can’t even count the code points, since some code points may be combining characters that don’t add an extra grapheme. You need to understand both where the code points lie in a stream of bytes and what the character class of the code point is. The character classes defined in Unicode are shown in Table 4-1.

Table 4-1. Unicode general categories

Codes

Descriptions

Lu

Letter, uppercase

Ll

Letter, lowercase

Lt

Letter, titlecase

Lm

Letter, modifier

Lo

Letter, other

Mn

Mark, nonspacing

Mc

Mark, spacing combining

Me

Mark, enclosing

Nd

Number, decimal digit

Nl

Number, letter

No

Number, other

Zs

Separator, space

Zl

Separator, line

Zp

Separator, paragraph

Cc

Other, control

Cf

Other, format

Cs

Other, surrogate

Co

Other, private use

Cn

Other, not assigned (including noncharacters)

Pc

Punctuation, connector

Pd

Punctuation, dash

Ps

Punctuation, open

Pe

Punctuation, close

Pi

Punctuation, initial quote (may behave like Ps or Pe depending on usage)

Pf

Punctuation, final quote (may behave like Ps or Pe depending on usage)

Po

Punctuation, other

Sm

Symbol, math

Sc

Symbol, currency

Sk

Symbol, modifier

So

Symbol, other

In fact, Unicode defines more than just a general category for each character—the standard also defines a name, general characteristics (alphabetic, ideographic, etc.), shaping information (bidi, mirroring, etc.), casing (upper, lower, etc.), numeric values, normalization properties, boundaries, and a whole slew of other useful information. This will mostly not concern us, and we won’t realize when we’re using this information since it happens magically in the background, but it’s worth noting that a core part of the Unicode standard, in addition to the code points themselves, are their properties.

These properties and characteristics, together with the normalization rules, are all available from the Unicode web site (http://www.unicode.org/) in both human- and computer-readable formats.

Byte Order Mark

A byte order mark (BOM) is a sequence of bytes at the beginning of a Unicode stream used to designate the encoding type. Because systems can be big endian or little endian, multibyte Unicode encodings such as UTF-16 can store the bytes that constitute a single code point in either order (highest or lowest byte first). BOMs work by putting the code point U+FEFF (reserved for this purpose) at the start of the file. The actual output in bytes depends on the encoding used, so after reading the first four bytes of a Unicode stream, you can figure out the encoding used (Table 4-2).

Table 4-2. BOMs for common Unicode encodings

Encoding

Byte order mark

UTF-16 big endian

FE FF

UTF-16 little endian

FF FE

UTF-32 big endian

00 00 FE FF

UTF-32 little endian

FF FE 00 00

UTF-8 little endian

EF BB BF

Most other Unicode encodings have their own BOMs (including SCSU, UTF-7, and UTF-EBCDIC) that all represent the code point U+FEFF. BOMs should be avoided at the start of served HTML and XMLdocuments because they’ll mess up some browsers. You also want to avoid putting a BOM at the start of your PHP templates or source code files, even though they might be UTF-8 encoded, because PHP won’t accept it.

For more specific information about the Unicode standard, you should visit the Unicode web site by the Unicode Consortium at http://www.unicode.org/ or buy the Unicode book The Unicode Standard 4.0 (Addison-Wesley) (which is a lot of fun, as it contains all 98,000 of the current Unicode code points), which you can order from http://www.unicode.org/book/bookform.html.

The UTF-8 Encoding

UTF-8 is the encoding favored by most web application developers, which stands for Unicode Transformation Format 8-bit. UTF-8 is a variable-length encoding, optimized for compact storage of Latin-based characters. For those characters, it saves space over larger fixed-width encodings (such as UTF-16), and also provides support for encoding a huge range of code points. UTF-8 is completely compatible with ASCII (also known as ISO standard 646). Since ASCII only defines encodings for the code points 0 through 127 (using the first 7 bits of the byte), UTF-8 keeps all those encodings as is, and uses the high bit for higher code points.

UTF-8 works by encoding the length of the code point’s representation in bytes into the first byte, and then using subsequent bytes to add to the number of representable bits. Each byte in a UTF-8 character encoding sequence contributes between 0 and 7 bits to the final code point, and works like a long binary number from left to right. The bits that make up the binary representation of each code point are based on the bit masks shown in Table 4-3.

Table 4-3. UTF-8 byte layout

Bytes

Bits

Representation

1

7

0bbbbbbb

2

11

110bbbbb 10bbbbbb

3

16

1110bbbb 10bbbbbb 10bbbbbb

4

21

11110bbb 10bbbbbb 10bbbbbb 10bbbbbb

5

26

111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

6

31

1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

7

36

11111110 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

8

42

11111111 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

This means that for the code point U+09E0 (my favorite—the Bengali vocalic RR) we need to use 3 bytes, since we need to represent 12 bits of data (09E0 in hexadecimal is 100111100000 in binary). We combine the bits of the code point with the bit mask and get 11100000 10100111 10100000 or 0xE0 0xA7 0xA0 (which you might recognize from the previous example).

One of the nice aspects of the UTF-8 design is that since it encodes as a stream of bytes rather than a set of code points as WORDs or DWORDs, it ignores the endian-ness of the underlying machine. This means that you can swap a UTF-8 stream between a little endian and a big endian machine without having to do any byte reordering or adding a BOM. You can completely ignore the underlying architecture.

Another handy feature of the UTF-8 encoding is that as it stores the bits of the actual code point from left to right, performing a binary sort of the raw bytes that lists strings in code point order. While this isn’t as good as using locale-based sorting rules, it’s a great way of doing very cheap ordering—the underlying system doesn’t need to understand UTF-8, just how to sort raw bytes.

UTF-8 Web Applications

When we talk about making an application use UTF-8, what do we mean? It means a few things, all of which are fairly simple but need to be borne in mind throughout your development.

Handling Output

We want all of our outputted pages to be served using UTF-8. To do this, we need to create our markup templates using an editor that is Unicode aware. When we go to save our files, we ask for them to be saved in UTF-8. For the most part, if you were previously using Latin-1 (more officially called ISO-8859-1), then nothing much will change. In fact, nothing at all will change unless you were using some of the higher accented characters. With your templates encoded into UTF-8, all that’s left is to tell the browser how the pages that you’re serving are encoded. You can do this using the content-type header’s charset property:

Content-Type: text/html; charset=utf-8

If you haven’t yet noticed, charset is a bizarre name to choose for this property—it represents both character set and encoding, although mostly encoding. So how do we output this header with our pages? There are a few ways, and a combination of some or all will work well for most applications.

Sending out a regular HTTP header can be done via your application’s code or through your web server configuration. If you’re using Apache, then you can add the AddCharset directive to either your main httpd.conf file or a specific .htaccess file to set the charset header for all documents with the given extension:

AddCharset UTF-8 .php

In PHP, you can output HTTP headers using the simple header() function. To output the specific UTF-8 header, use the following code:

header("Content-Type: text/html; charset=utf-8");

The small downside to this approach is that you also need to explicitly output the main content-type (text/html in our example), rather than letting the web server automatically determine the type to send based on the browser’s user agent—this can matter when choosing whether to send a content-type of text/html or application/xhtml+xml (since the latter is technically correct but causes Netscape 4 and some versions of Internet Explorer 6 to prompt you to download the page).

In addition to sending out the header as part of the regular HTTP request, you can include a copy of the header in the HTML body by using the meta tag. This can be easily added to your pages by placing the following HTML into the head tag in your templates:

<meta http-equiv="Content-Type" content=
"text/html; charset=UTF-8">

The advantage of using a meta tag over the normal header is that should anybody save the page, which would save only the request body and not headers, then the encoding would still be present. It’s still important to send a header and not just use the meta tag for a couple of important reasons. First, your web server might already be sending an incorrect encoding, which would override the http-equiv version; you’d need to either suppress this or replace it with the correct header. Second, most browsers will have to start re-parsing the document after reaching the meta tag, since they may have already parsed text assuming the wrong encoding. This can create a delay in page rendering or, depending on the user’s browser, be ignored all together. It hopefully goes without saying that the encoding in your HTTP header should match that in your meta tag; otherwise, the final rendering of the page will be a little unpredictable.

To serve documents other than HTML as UTF-8, the same rules apply. For XML documents and feeds, you can again use an HTTP header, with a different main content-type:

header("Content-Type: text/xml; charset=utf-8");

Unlike HTML, XML has no way to include arbitrary HTTP headers in documents. Luckily, XML has direct support for encodings (appropriately named this time) as part of the XML preamble. To specify your XML document as UTF-8, you simply need to indicate it as such in the preamble:

<?xml version="1.0" encoding="utf-8"?>

Handling Input

Input sent back to your application via form fields will automatically be sent using the same character set and encoding as the referring page was sent out in. That is to say, if all of your pages are UTF-8 encoded, then all of your input will also be UTF-8 encoded. Great!

Of course, there are some caveats to this wonderful utopia of uniform input. If somebody creates a form on another site that submits data to a URL belonging to your application, then the input will be encoded using the character set of the form from which the data originated. Very old browsers may always send data in a particular encoding, regardless of the one you asked for. Users might build applications that post data to your application accidentally using the wrong encoding. Some users might create applications that purposefully post data in an unexpected encoding.

All of these input vectors result in the same outcome—all incoming data has to be filtered before you can safely use it. We’ll talk about that in a lot more detail in the next chapter.

Using UTF-8 with PHP

One of the side effects of UTF-8 being a byte-oriented encoding is that so long as you don’t want to mess with the contents of a string, you can pass it around blindly using any system that is xsbinary safe (by binary safe, we mean that we can store any byte values in a “string” and will always get exactly the same bytes back).

This means that PHP 4 and 5 can easily support a Unicode application without any character set or encoding support built into the language. If all we do is receive data encoded using UTF-8, store it, and then blindly output it, we never have to do anything more than copy a block of bytes around.

But there are some operations that you might need to perform that are impossible without some kind of Unicode support. For instance, you can’t perform a regular substr() (substring) operation. substr() is a byte-wise operation, and you can’t safely cut a UTF-8-encoded string at arbitrary byte boundaries. If you, for instance, cut off the first 3 bytes of a UTF-8-encoded string, that cut might come down in the middle of a character sequence, and you’ll be left with an incomplete character.

If you’re tempted at this point to move to a fixed-width encoding such as UCS2, it’s worth noting that you still can’t blindly cut Unicode strings, even at character boundaries (which can be easily found in a fixed width encoding). Because Unicode allows combining characters for diacritical and other marks, a chop between two code points could result in a character at the end of the string missing its accents, or stray accents at the beginning of the string (or strange side effects from double width combining marks, which are too confusing to contemplate here).

Any function that in turn relies on substring operations can not be safely used either. For PHP, this includes things such as wordwrap() and chunk_split().

In PHP, Unicode substring support comes from the mbstring (multibyte string) extension, which does not come bundled with the default PHP binaries. Once this extension is installed, it presents you with alternative string manipulation functions: mb_substr() replaces substr() and so on. In fact, the mbstring extension contains support for overloading the existing string manipulation functions, so simply calling the regular function will actually call the mb_...() function automatically. It’s worth noting though that overloading can also cause issues. If you’re using any of the string manipulation functions anywhere to handle binary data (and here we mean real binary data, not textual data treated as binary), then if you overload the string manipulation functions, you will break your binary handling code. Because of this, it’s often safest to explicitly use multibyte functions where you mean to.

In addition to worrying about the manipulation of UTF-8-encoded strings, the other function you’ll need at the language level is the ability to verify the validity of the data. Not every stream of bytes is valid UTF-8. We’ll explore this in depth in Chapter 5.

Using UTF-8 with Other Languages

The techniques we’ve talked about with PHP apply equally well to other languages that lack core Unicode support, including Perl versions previous to 5.6 and older legacy languages. As long as the language can transparently work with streams of bytes, we can pass around strings as opaque chunks of binary data. For any string manipulation or verification, we’ll need to shell out to a dedicated library such as iconv or ICU to do the dirty work.

Many languages now come with full or partial Unicode support built in. Perl versions 5.8.0 and later can work transparently with Unicode strings, while version 5.6.0 has limited support using the use utf8 pragma. Perl 6 plans to have very extensive Unicode support, allowing you to manipulate strings at the byte, code point, and grapheme levels. PHP 6 plans to have Unicode support built right into the language, which should make porting existing code a fairly painless experience. Ruby 1.8 has no explicit Unicode support—like PHP, it treats strings as sequences of 8-bit bytes. Unicode support of some kind is planned for Ruby 1.9/2.0.

Java and .NET both have full Unicode support, which means you can skip the annoying workarounds in this chapter and work directly with strings inside the languages. However, even with native Unicode strings, you’ll always need to ensure that the data you receive from the outside world is valid in your chosen encoding. The default behavior for your language may be to throw an error when you attempt to manipulate a badly encoded string, so either filtering strings at the input boundary or being ready to catch possible exceptions deep inside your application is important. It’s well worth picking up a book specific to using Unicode strings with your language of choice.

Using UTF-8 with MySQL

As with PHP, as long as your medium supports raw bytes streams, then it supports UTF-8. MySQL does indeed support byte streams, so storing and retrieving UTF-8-encoded strings works in just the same way as storing plain ASCII text.

If we can read and write data, what else is left? As with PHP, there are a couple of important issues. Sorting, something you often want to do in the database layer rather than the code layers, will also need to work with our Unicode data. Luckily for us, as we already discussed, UTF-8 can be binary sorted and comes out in code point order. This means that the regular MySQL sort works fine with your UTF-8 data, as long as you define your columns with the BINARY attribute (for CHAR and VARCHAR columns) and use BLOB instead of TEXT types.

As with PHP, the thing we need to worry about is string manipulation. You can usually avoid most string manipulation by moving logic from your SQL into your code layers. Avoid using SQL statements of this type:

SELECT SUBSTRING(name, 0, 1) FROM UserNames;

Instead, move the same logic into your business logic layer:

<?php
 $rows = db_fetch_all("SELECT name FROM UserNames;");
foreach($rows as $k => $v){
          $rows[$k]['name'] = mb_substr($v['name'], 0, 1);
  }
>

In some cases this is going to be a problem—if you were using a substring operation in your SQL to select or join against, then you’ll no longer be able to perform that operation. The alternative is either to have character set support inside your database (which we’ll talk about in a moment) or to lay out your data in such a way that you simplify the query. For instance, if you were performing a substring operation to group records by the first character in a certain field, you could store the first character (as a set of normalized code points) in a separate field and use that field directly, avoiding any string operations inside the database.

MySQL also has another set of string manipulation functions that it uses in the background, which you can easily miss. To create FULLTEXT indexes, MySQL needs to chop up the input string into different words to individually index. Without support for UTF-8, Unicode strings will be incorrectly sliced up for indexing, which can return some really bizarre and unexpected results.

Unlike the explicit string manipulation functions, there’s no way you can move the text-indexing logic into your code layers without rewriting the text indexer from scratch. Since a text indexer is a fairly sophisticated piece of code, and somebody has already written it for us in the shape of MySQL’s FULLTEXT indexes, it would be a big waste of your time to implement yourself.

Luckily, MySQL version 4.1 saved us from doing any work; it comes with support for multiple character sets and collations, including UTF-8. When you create a table, you can specify per column character sets, or you can set a default for a server, database, or table to avoid having to be specific every time you create a new column. Data in this column is then stored in that format, regular string manipulation functions can be used, and FULLTEXT indexes work correctly.

It also has the nice benefit of changing column-length specifications from bytes to characters. Previous to version 4.1, a MySQL column type of CHAR(10) meant 10 bytes, so you could store between 2 and 10 UTF-8 characters. In version 4.1, CHAR(10) means 10 characters and so might take up 10 or more bytes. If you’re concerned about space, you should avoid using the CHAR type (and instead use VARCHAR) as a CHAR(10) column actually needs 30 bytes to account for each of the 10 characters potentially having 3 bytes.

MySQL currently has a limitation of 3 bytes per characters for UTF-8, which means it can’t store code points above U+FFFF. This probably isn’t an issue for most people: this region contains musical symbols, Old Persian characters, Aegean numbers, and other such oddities. But it’s worth bearing in mind that some code points can’t be stored, and you might want to account for this in your data-filtering code.

Using UTF-8 with Email

If your application sends out email, then it will need to support the character set and encoding used by the application itself—otherwise, you’d be in a situation where a user can register using a Cyrillic name, but you can’t include a greeting to that user in any email you send.

Specifying the character set and encoding to be used with an outgoing email is very similar to specifying for a web page. Every email has one or more blocks of headers, in a format similar to HTTP headers, describing various things about the mail—the recipient, the time, the subject, and so on. Character sets and encodings are specified through the content-type header, as with HTTP responses:

Content-Type: text/plain; charset=utf-8

The problem with the content-type header is that it describes the contents of the email body. As with HTTP, email headers must be pure ASCII—many mail transport agents are not 8-bit safe and so will strip characters outside of the ASCII range. If we want to put any string data into the headers, such as subjects or sender names, then we have to do it using ASCII.

Clearly, this is madness—you have lovely UTF-8 data and you want to use it in your email subject lines. Luckily, there’s a fairly simple solution. Headers can include something defined in RFC 1342 (“Representation of Non-ASCII Text in Internet Message Headers”) as an encoded word. An encoded word looks like this:

=?utf-8?Q?hello_=E2=98=BA?= 
=?charset?encoding?encoded-text?=

The charset element contains the character set name and whether the encoding is either “B” or “Q.” The encoded text is the string in the specified character set, encoded using the specified method.

The “B” encoding is straightforward base64, as defined in RFC 3548. The “Q” encoding is a variation on quoted-printable, with the following rules:

  • Any byte can be represented as a literal equal sign (=) followed by a two character hex digit. For example, the byte 0x8A can be represented by the sequence =8A.

  • Spaces (byte 0x20) must be replaced with the literal underscore ( _, byte 0x5F).

  • ASCII alphanumeric characters can be left as is.

The quoted printable “Q” method is usually preferred because simple ASCII strings are still recognizable. This can aid debugging greatly and allow you to easily read the raw headers of a mail on an ASCII terminal and mostly understand them.

This encoding can be accomplished with a small PHP function:

function email_escape($text){
        $text = preg_replace('/([^a-z ])/ie', 'sprintf("=%02x", ord(StripSlashes("
       \\1")))', $text);$text = str_replace(' ', '_', $text);
     return "=?utf-8?Q?$text?=";
}

We can make a small improvement to this, though—we only need to escape strings that contain more than the basic characters. We save a couple of bytes for each email sent out and make the source more generally readable:

function email_escape($text){
      if (preg_match('/[^a-z ]/i', $text)){
          $text = preg_replace('/([^a-z ])/ie', 'sprintf("=%02x",
       ord(StripSlashes("\\1")))', $text);$text = str_replace(' ', '_', $text);
         return "=?utf-8?Q?$text?=";
       }
       return $text;
}

RFC 1342 states that the length of any individual encoded part should not be longer than 75 characters; to make our function fully compliant, we need to add some further modifications. Since we know each encoded part will need 12 characters of extra fluff (go on, count them), we can split up our encoded text into blocks of 63 characters or less, wrapping each with the prefix and postfix, with a new line between each. Of course, we’ll need to be careful not to split an encoded character down the middle. Implementing the full function is left as an exercise for the reader.

We’ve talked about both body and header encoding, so all that’s left is to bundle up what we’ve learned into a single function for safely sending UTF-8 email:

function email_send($to_name, $to_email, $subject, $message, $from_name,
$from_email){
     $from_name = email_escape($from_name);
       $to_name   = email_escape($to_name);
       $headers  = "To: \"$to_name\" <$to_email>\r\n";
       $headers .= "From: \"$from_name\" <$from_email>\r\n";
     $headers .= "Reply-To: $from_email\r\n";
     $headers .= "Content-Type: text/plain; charset=utf-8";
       $subject = email_escape($subject);
       mail($to_email, $subject, $message, $headers);
}

Using UTF-8 with JavaScript

Modern browsers have Unicode support built right into JavaScript—the basic String class stores code points rather than bytes, and the string manipulation functions work correctly. When you copy data in and out of forms using JavaScript, the data that finally gets submitted is UTF-8 (assuming you specified that for the page’s encoding type).

The only thing to watch out for is that the built-in function escape(), which is used to format strings for inclusion in a URL, does not support Unicode characters. This means that if you want to let users input text that you’ll then build a URL from (such as building a GET query string), then you can’t use escape().

Luckily, since JavaScript supports code points natively and allows you to query them using the String.getCodeAt() method, you can fairly easily write your own UTF-8-safe escaping function:

function escape_utf8(data) {
        if (data == '' || data == null){
               return '';
        }
       data = data.toString();
       var buffer = '';
       for(var i=0; i<data.length; i++){
               var c = data.charCodeAt(i);
               var bs = new Array();
              if (c > 0x10000){
                       // 4 bytes
                       bs[0] = 0xF0 | ((c & 0x1C0000) >>> 18);
                       bs[1] = 0x80 | ((c & 0x3F000) >>> 12);
                       bs[2] = 0x80 | ((c & 0xFC0) >>> 6);
                   bs[3] = 0x80 | (c & 0x3F);
               }else if (c > 0x800){
                        // 3 bytes
                        bs[0] = 0xE0 | ((c & 0xF000) >>> 12);
                        bs[1] = 0x80 | ((c & 0xFC0) >>> 6);
                       bs[2] = 0x80 | (c & 0x3F);
             }else if (c > 0x80){
                      // 2 bytes
                       bs[0] = 0xC0 | ((c & 0x7C0) >>> 6);
                      bs[1] = 0x80 | (c & 0x3F);
               }else{
                       // 1 byte
                    bs[0] = c;
              }
             for(var j=0; j<bs.length; j++){
                      var b = bs[j];
                       var hex = nibble_to_hex((b & 0xF0) >>> 4) 
                      + nibble_to_hex(b &0x0F);buffer += '%'+hex;
              }
    }
    return buffer;
}
function nibble_to_hex(nibble){
        var chars = '0123456789ABCDEF';
        return chars.charAt(nibble);
}

The escape_utf8() function works by iterating over each code point in the string, creating a UTF-8 byte stream. It then loops over the bytes in this stream, formatting each one using the %XX format for escaping bytes in a URL. A further improvement to this function would be to leave alphanumeric characters as-is in the escaped version of the string, so that the returned values are easily readable in the common case.

Using UTF-8 with APIs

An API has two vectors over which you’re going to need to enforce a character set and encoding: input and output. (Throughout this book, the term API refers to external web services APIs, unless otherwise noted. We’re not talking about language tools or classes.)

As far as the output goes, you probably already have it covered. If API responses are XML based, then you can use the same HTTP and XML headers as we previously discussed. If your output is HTML based, the HTTP header and <meta> tag combination will work fine.

For other custom outputs, using a BOM can be a good idea if you have some way to determine the start of a stream. If you can’t or don’t want to use a BOM, nothing beats just documenting what you’re sending. Making your output character set and encoding explicit early on will guard against people developing applications that work at first but crash when they finally encounter some foreign text.

Input to APIs can be a bigger problem. As the saying goes, the only things less intelligent than computers are their users. If you expose a public API to your application, you can’t guarantee that the text sent will be in the correct character set. As with all input vectors, it’s extremely important to verify that all input is both valid and good—something we’re going to look at in detail in the next chapter.

Get Building Scalable Web Sites now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.