Unicode

Unicode provides a unique number for every character, regardless of the computing platform, program, or programming language. This is particularly important because without a standard such as Unicode, computers would continue to use different encoding classes for characters, many of which would conflict if character classes were used together.

Unicode support was introduced to Perl with Perl 5.6. Although it is still not completely adherent in the Unicode spec, Unicode support has matured significantly under Perl 5.8. You can now use Unicode reliably with file I/O and with regular expressions. With regular expressions, the pattern will adapt to the data and will automatically switch to the correct Unicode character scheme.

Perl’s Unicode implementation falls into the following categories:

I/O

There is currently no way in Perl to mark data that’s read from or written to a file as being of type Unicode (utf8). Future versions of Perl will support such a feature.

Regular expressions

The determination whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters and not when matching happens at runtime. This will be changed to match Unicode characters at runtime.

use utf8

The utf8 module is still needed to enable a few Unicode features. The utf8 pragma, as implemented by the utf8 module, implements tables used for Unicode support. You must load the utf8 pragma explicitly to enable recognition of UTF-8 encoded literals and identifiers in the source text.

Byte and character semantics

As of 5.6.0, Perl uses logically wide characters to represent strings internally. This internal representation uses the UTF-8 encoding. Future versions of Perl will work with characters rather than bytes. This was a purposeful decision made so Perl 5.6 could transition from byte semantics to character semantics in programs. Perl will make the decision to switch to character semantics if it finds that the input data has characters on which it can safely operate with UTF-8. You can disable character semantics by using the bytes pragma, as explained in Chapter 8. Character semantics have the following effects:

  • Strings and patterns may contain characters that have an ordinal value larger than 255.

  • Identifiers within a Perl program may contain Unicode alphanumeric characters.

  • Regular expressions match characters and not bytes.

  • Character classes in regular expressions match characters and not bytes.

  • Named Unicode properties and block ranges may be used as character classes with the \p and \P constructs.

  • \X matches any extended Unicode sequence.

  • tr// matches characters instead of bytes.

  • Case translation operators use the Unicode case translation tables when provided character input.

  • Most operators that deal with positions or lengths in a string switch to using character positions.

  • pack( ) and unpack( ) do not change.

  • Bit operators work on characters.

  • scalar reverse( ) reverses characters and not bytes.

Get Perl in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.