Unicode
Plain strings are converted into Unicode
strings either explicitly, with the unicode
built-in, or implicitly, when you pass a plain string to a function
that expects Unicode. In either case, the conversion is done by an
auxiliary object known as a codec
(for
coder-decoder). A codec can also convert Unicode strings to plain
strings either explicitly, with the encode
method
of Unicode strings, or
implicitly.
You
identify a codec by passing the codec name to
unicode
or encode
. When you
pass no codec name and for implicit conversion, Python uses a default
encoding, normally 'ascii
‘. (You can change the
default encoding in the startup phase of a Python program, as covered
in Chapter 13; see also
setdefaultencoding in Chapter 8.) Every conversion has an explicit or implicit
argument errors
, a string specifying how
conversion errors are to be handled. The default is
'strict
', meaning any error raises an exception.
When errors
is
'replace
', the conversion replaces each character
causing an error with '?
' in a plain-string result
or with u'\ufffd
' in a Unicode result. When
errors
is 'ignore
', the
conversion silently skips characters that cause
errors.
The codecs Module
The mapping of codec names to codec
objects is handled by the codecs
module. This
module lets you develop your own codec objects and register them so
that they can be looked up by name, just like built-in codecs. Module
codecs
also lets you look up any codec explicitly, obtaining the functions the codec uses ...
Get Python in a Nutshell now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.