Unicode

Plain strings are converted into Unicode strings either explicitly, with the unicode built-in, or implicitly, when you pass a plain string to a function that expects Unicode. In either case, the conversion is done by an auxiliary object known as a codec (for coder-decoder). A codec can also convert Unicode strings to plain strings either explicitly, with the encode method of Unicode strings, or implicitly.

You identify a codec by passing the codec name to unicode or encode. When you pass no codec name and for implicit conversion, Python uses a default encoding, normally 'ascii‘. (You can change the default encoding in the startup phase of a Python program, as covered in Chapter 13; see also setdefaultencoding in Chapter 8.) Every conversion has an explicit or implicit argument errors, a string specifying how conversion errors are to be handled. The default is 'strict', meaning any error raises an exception. When errors is 'replace', the conversion replaces each character causing an error with '?' in a plain-string result or with u'\ufffd' in a Unicode result. When errors is 'ignore', the conversion silently skips characters that cause errors.

The codecs Module

The mapping of codec names to codec objects is handled by the codecs module. This module lets you develop your own codec objects and register them so that they can be looked up by name, just like built-in codecs. Module codecs also lets you look up any codec explicitly, obtaining the functions the codec uses ...

Get Python in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.