Caveats

As of this writing (that is, with respect to version 5.6.0 of Perl), there are still some caveats on use of Unicode. (Check your online docs for updates.)

  • The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination of whether a particular pattern will match Unicode characters is made when the pattern is compiled (based on whether the pattern contains Unicode characters) and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode.

  • There is currently no easy way to mark data read from a file or other external source as being utf8. This will be a major area of focus in the near future and is probably already fixed as you read this.

  • There is no method for automatically coercing input and output to some encoding other than UTF-8. This is planned in the near future, however, so check your online docs.

  • Use of locales with utf8 may lead to odd results. Currently, there is some attempt to apply 8-bit locale information to characters in the range 0..255, but this is demonstrably incorrect for locales that use characters above that range (when mapped into Unicode). It will also tend to run slower. Avoidance of locales is strongly encouraged.

Unicode is fun--you just have to define fun correctly.

Get Programming Perl, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.