Chapter 1. Text

Introduction

Credit: Fred L. Drake, Jr., PythonLabs

Text-processing applications form a substantial part of the application space for any scripting language, if only because everyone can agree that text processing is useful. Everyone has bits of text that need to be reformatted or transformed in various ways. The catch, of course, is that every application is just a little bit different from every other application, so it can be difficult to find just the right reusable code to work with different file formats, no matter how similar they are.

What Is Text?

Sounds like an easy question, doesn’t it? After all, we know it when we see it, don’t we? Text is a sequence of characters, and it is distinguished from binary data by that very fact. Binary data, after all, is a sequence of bytes.

Unfortunately, all data enters our applications as a sequence of bytes. There’s no library function we can call that will tell us whether a particular sequence of bytes represents text, although we can create some useful heuristics that tell us whether data can safely (not necessarily correctly) be handled as text. Recipe 1.11 shows just such a heuristic.

Python strings are immutable sequences of bytes or characters. Most of the ways we create and process strings treat them as sequences of characters, but many are just as applicable to sequences of bytes. Unicode strings are immutable sequences of Unicode characters: transformations of Unicode strings into and from plain strings use codecs (coder-decoders) objects that embody knowledge about the many standard ways in which sequences of characters can be represented by sequences of bytes (also known as encodings and character sets). Note that Unicode strings do not serve double duty as sequences of bytes. Recipe 1.20, Recipe 1.21, and Recipe 1.22 illustrate the fundamentals of Unicode in Python.

Okay, let’s assume that our application knows from the context that it’s looking at text. That’s usually the best approach because that’s where external input comes into play. We’re looking at a file either because it has a well-known name and defined format (common in the “Unix” world) or because it has a well-known filename extension that indicates the format of the contents (common on Windows). But now we have a problem: we had to use the word format to make the previous paragraph meaningful. Wasn’t text supposed to be simple?

Let’s face it: there’s no such thing as “pure” text, and if there were, we probably wouldn’t care about it (with the possible exception of applications in the field of computational linguistics, where pure text may indeed sometimes be studied for its own sake). What we want to deal with in our applications is information contained in text. The text we care about may contain configuration data, commands to control or define processes, documents for human consumption, or even tabular data. Text that contains configuration data or a series of commands usually can be expected to conform to a fairly strict syntax that can be checked before relying on the information in the text. Informing the user of an error in the input text is typically sufficient to deal with things that aren’t what we were expecting.

Documents intended for humans tend to be simple, but they vary widely in detail. Since they are usually written in a natural language, their syntax and grammar can be difficult to check, at best. Different texts may use different character sets or encodings, and it can be difficult or even impossible to tell which character set or encoding was used to create a text if that information is not available in addition to the text itself. It is, however, necessary to support proper representation of natural-language documents. Natural-language text has structure as well, but the structures are often less explicit in the text and require at least some understanding of the language in which the text was written. Characters make up words, which make up sentences, which make up paragraphs, and still larger structures may be present as well. Paragraphs alone can be particularly difficult to locate unless you know what typographical conventions were used for a document: is each line a paragraph, or can multiple lines make up a paragraph? If the latter, how do we tell which lines are grouped together to make a paragraph? Paragraphs may be separated by blank lines, indentation, or some other special mark. See Recipe 19.10 for an example of reading a text file as a sequence of paragraphs separated by blank lines.

Tabular data has many issues that are similar to the problems associated with natural-language text, but it adds a second dimension to the input format: the text is no longer linear—it is no longer a sequence of characters, but rather a matrix of characters from which individual blocks of text must be identified and organized.

Basic Textual Operations

As with any other data format, we need to do different things with text at different times. However, there are still three basic operations:

  • Parsing the data into a structure internal to our application

  • Transforming the input into something similar in some way, but with changes of some kind

  • Generating completely new data

Parsing can be performed in a variety of ways, and many formats can be suitably handled by ad hoc parsers that deal effectively with a very constrained format. Examples of this approach include parsers for RFC 2822-style email headers (see the rfc822 module in Python’s standard library) and the configuration files handled by the ConfigParser module. The netrc module offers another example of a parser for an application-specific file format, this one based on the shlex module. shlex offers a fairly typical tokenizer for basic languages, useful in creating readable configuration files or allowing users to enter commands to an interactive prompt. These sorts of ad hoc parsers are abundant in Python’s standard library, and recipes using them can be found in Chapter 2 and Chapter 13. More formal parsing tools are also available for Python; they depend on larger add-on packages and are surveyed in the introduction to Chapter 16.

Transforming text from one format to another is more interesting when viewed as text processing, which is what we usually think of first when we talk about text. In this chapter, we’ll take a look at some ways to approach transformations that can be applied for different purposes. Sometimes we’ll work with text stored in external files, and other times we’ll simply work with it as strings in memory.

The generation of textual data from application-specific data structures is most easily performed using Python’s print statement or the write method of a file or file-like object. This is often done using a method of the application object or a function, which takes the output file as a parameter. The function can then use statements such as these:

print >>thefile, sometext
thefile.write(sometext)

which generate output to the appropriate file. However, this isn’t generally thought of as text processing, as here there is no input text to be processed. Examples of using both print and write can of course be found throughout this book.

Sources of Text

Working with text stored as a string in memory can be easy when the text is not too large. Operations that search the text can operate over multiple lines very easily and quickly, and there’s no need to worry about searching for something that might cross a buffer boundary. Being able to keep the text in memory as a simple string makes it very easy to take advantage of the built-in string operations available as methods of the string object.

File-based transformations deserve special treatment, because there can be substantial overhead related to I/O performance and the amount of data that must actually be stored in memory. When working with data stored on disk, we often want to avoid loading entire files into memory, due to the size of the data: loading an 80 MB file into memory should not be done too casually! When our application needs only part of the data at a time, working on smaller segments of the data can yield substantial performance improvements, simply because we’ve allowed enough space for our program to run. If we are careful about buffer management, we can still maintain the performance advantage of using a small number of relatively large disk read and write operations by working on large chunks of data at a time. File-related recipes are found in Chapter 12 .

Another interesting source for textual data comes to light when we consider the network. Text is often retrieved from the network using a socket. While we can always view a socket as a file (using the makefile method of the socket object), the data that is retrieved over a socket may come in chunks, or we may have to wait for more data to arrive. The textual data may not consist of all data until the end of the data stream, so a file object created with makefile may not be entirely appropriate to pass to text-processing code. When working with text from a network connection, we often need to read the data from the connection before passing it along for further processing. If the data is large, it can be handled by saving it to a file as it arrives and then using that file when performing text-processing operations. More elaborate solutions can be built when the text processing needs to be started before all the data is available. Examples of parsers that are useful in such situations may be found in the htmllib and HTMLParser modules in the standard library.

String Basics

The main tool Python gives us to process text is strings—immutable sequences of characters. There are actually two kinds of strings: plain strings, which contain 8-bit (ASCII) characters; and Unicode strings, which contain Unicode characters. We won’t deal much with Unicode strings here: their functionality is similar to that of plain strings, except each character takes up 2 (or 4) bytes, so that the number of different characters is in the tens of thousands (or even billions), as opposed to the 256 different characters that make up plain strings. Unicode strings are important if you must deal with text in many different alphabets, particularly Asian ideographs. Plain strings are sufficient to deal with English or any of a limited set of non-Asian languages. For example, all western European alphabets can be encoded in plain strings, typically using the international standard encoding known as ISO-8859-1 (or ISO-8859-15, if you need the Euro currency symbol as well).

In Python, you express a literal string (curiously more often known as a string literal) as:

'this is a literal string'
"this is another string"

String values can be enclosed in either single or double quotes. The two different kinds of quotes work the same way, but having both allows you to include one kind of quotes inside of a string specified with the other kind of quotes, without needing to escape them with the backslash character:

'isn\'t that grand'
"isn't that grand"

To have a string literal span multiple lines, you can use a backslash as the last character on the line, which indicates that the next line is a continuation:

big = "This is a long string\
that spans two lines."

You must embed newlines in the string if you want the string to output on two lines:

big = "This is a long string\n\
that prints on two lines."

Another approach is to enclose the string in a pair of matching triple quotes (either single or double):

bigger = """
This is an even 
bigger string that 
spans three lines.
"""

Using triple quotes, you don’t need to use the continuation character, and line breaks in the string literal are preserved as newline characters in the resulting Python string object. You can also make a string literal "raw" string by preceding it with an r or R:

big = r"This is a long string\
with a backslash and a newline in it"

With a raw string, backslash escape sequences are left alone, rather than being interpreted. Finally, you can precede a string literal with a u or U to make it a Unicode string:

hello = u'Hello\u0020World'

Strings are immutable, which means that no matter what operation you do on a string, you will always produce a new string object, rather than mutating the existing string. A string is a sequence of characters, which means that you can access a single character by indexing:

mystr = "my string"     
mystr[0]        # 'm'
mystr[-2]       # 'n'

You can also access a portion of the string with a slice:

mystr[1:4]      # 'y s'
mystr[3:]       # 'string'
mystr[-3:]      # 'ing'

Slices can be extended, that is, include a third parameter that is known as the stride or step of the slice:

mystr[:3:-1]    # 'gnirt'
mystr[1::2]     # 'ysrn'

You can loop on a string’s characters:

for c in mystr:

This binds c to each of the characters in mystr in turn. You can form another sequence:

list(mystr)     # returns ['m','y',' ','s','t','r','i','n','g']

You can concatenate strings by addition:

mystr+'oid'     # 'my stringoid'

You can also repeat strings by multiplication:

'xo'*3          # 'xoxoxo'

In general, you can do anything to a string that you can do to any other sequence, as long as it doesn’t require changing the sequence, since strings are immutable.

String objects have many useful methods. For example, you can test a string’s contents with s.isdigit( ), which returns True if s is not empty and all of the characters in s are digits (otherwise, it returns False). You can produce a new modified string with a method call such as s.toupper( ), which returns a new string that is like s, but with every letter changed into its uppercase equivalent. You can search for a string inside another with haystack.count('needle'), which returns the number of times the substring 'needle' appears in the string haystack. When you have a large string that spans multiple lines, you can split it into a list of single-line strings with splitlines:

list_of_lines = one_large_string.splitlines( )

You can produce the single large string again with join:

one_large_string = '\n'.join(list_of_lines)

The recipes in this chapter show off many methods of the string object. You can find complete documentation in Python’s Library Reference and Python in a Nutshell.

Strings in Python can also be manipulated with regular expressions, via the re module. Regular expressions are a powerful (but complicated) set of tools that you may already be familiar with from another language (such as Perl), or from the use of tools such as the vi editor and text-mode commands such as grep. You’ll find a number of uses of regular expressions in recipes in the second half of this chapter. For complete documentation, see the Library Reference and Python in a Nutshell. J.E.F. Friedl, Mastering Regular Expressions (O’Reilly) is also recommended if you need to master this subject—Python’s regular expressions are basically the same as Perl’s, which Friedl covers thoroughly.

Python’s standard module string offers much of the same functionality that is available from string methods, packaged up as functions instead of methods. The string module also offers a few additional functions, such as the useful string.maketrans function that is demonstrated in a few recipes in this chapter; several helpful string constants (string.digits, for example, is '0123456789') and, in Python 2.4, the new class Template, for simple yet flexible formatting of strings with embedded variables, which as you’ll see features in one of this chapter’s recipes. The string-formatting operator, %, provides a handy way to put strings together and to obtain precisely formatted strings from such objects as floating-point numbers. Again, you’ll find recipes in this chapter that show how to use % for your purposes. Python also has lots of standard and extension modules that perform special processing on strings of many kinds. This chapter doesn’t cover such specialized resources, but Chapter 12 is, for example, entirely devoted to the important specialized subject of processing XML.

1.1. Processing a String One Character at a Time

Credit: Luther Blissett

Problem

You want to process a string one character at a time.

Solution

You can build a list whose items are the string’s characters (meaning that the items are strings, each of length of one—Python doesn’t have a special type for “characters” as distinct from strings). Just call the built-in list, with the string as its argument:

thelist = list(thestring)

You may not even need to build the list, since you can loop directly on the string with a for statement:

for c in thestring:
    do_something_with(c)

or in the for clause of a list comprehension:

results = [do_something_with(c) for c in thestring]

or, with exactly the same effects as this list comprehension, you can call a function on each character with the map built-in function:

results = map(do_something, thestring)

Discussion

In Python, characters are just strings of length one. You can loop over a string to access each of its characters, one by one. You can use map for much the same purpose, as long as what you need to do with each character is call a function on it. Finally, you can call the built-in type list to obtain a list of the length-one substrings of the string (i.e., the string’s characters). If what you want is a set whose elements are the string’s characters, you can call sets.Set with the string as the argument (in Python 2.4, you can also call the built-in set in just the same way):

import sets
magic_chars = sets.Set('abracadabra')
poppins_chars = sets.Set('supercalifragilisticexpialidocious')
print ''.join(magic_chars & poppins_chars)   # set intersectionacrd

See Also

The Library Reference section on sequences; Perl Cookbook Recipe 1.5.

1.2. Converting Between Characters and Numeric Codes

Credit: Luther Blissett

Problem

You need to turn a character into its numeric ASCII (ISO) or Unicode code, and vice versa.

Solution

That’s what the built-in functions ord and chr are for:

>>> print ord('a')97
>>> print chr(97)
a

The built-in function ord also accepts as its argument a Unicode string of length one, in which case it returns a Unicode code value, up to 65536. To make a Unicode string of length one from a numeric Unicode code value, use the built-in function unichr:

>>> print ord(u'\u2020')8224
>>> print repr(unichr(8224))
u'\u2020'

Discussion

It’s a mundane task, to be sure, but it is sometimes useful to turn a character (which in Python just means a string of length one) into its ASCII or Unicode code, and vice versa. The built-in functions ord, chr, and unichr cover all the related needs. Note, in particular, the huge difference between chr(n) and str(n), which beginners sometimes confuse...:

>>> print repr(chr(97))'a'
>>> print repr(str(97))
'97'

chr takes as its argument a small integer and returns the corresponding single-character string according to ASCII, while str, called with any integer, returns the string that is the decimal representation of that integer.

To turn a string into a list of character value codes, use the built-in functions map and ord together, as follows:

>>> print map(ord, 'ciao')
[99, 105, 97, 111]

To build a string from a list of character codes, use ''.join, map and chr; for example:

>>> print ''.join(map(chr, range(97, 100)))abc

See Also

Documentation for the built-in functions chr, ord, and unichr in the Library Reference and Python in a Nutshell.

1.3. Testing Whether an Object Is String-like

Credit: Luther Blissett

Problem

You need to test if an object, typically an argument to a function or method you’re writing, is a string (or more precisely, whether the object is string-like).

Solution

A simple and fast way to check whether something is a string or Unicode object is to use the built-ins isinstance and basestring, as follows:

def isAString(anobj):
    return isinstance(anobj, basestring)

Discussion

The first approach to solving this recipe’s problem that comes to many programmers’ minds is type-testing:

def isExactlyAString(anobj):
    return type(anobj) is type('')

However, this approach is pretty bad, as it willfully destroys one of Python’s greatest strengths—smooth, signature-based polymorphism. This kind of test would reject Unicode objects, instances of user-coded subclasses of str, and instances of any user-coded type that is meant to be “string-like”.

Using the isinstance built-in function, as recommended in this recipe’s Solution, is much better. The built-in type basestring exists exactly to enable this approach. basestring is a common base class for the str and unicode types, and any string-like type that user code might define should also subclass basestring, just to make sure that such isinstance testing works as intended. basestring is essentially an “empty” type, just like object, so no cost is involved in subclassing it.

Unfortunately, the canonical isinstance checking fails to accept such clearly string-like objects as instances of the UserString class from Python Standard Library module UserString, since that class, alas, does not inherit from basestring. If you need to support such types, you can check directly whether an object behaves like a string—for example:

def isStringLike(anobj):
    try: anobj + ''
    except: return False
    else: return True

This isStringLike function is slower and more complicated than the isAString function presented in the “Solution”, but it does accept instances of UserString (and other string-like types) as well as instances of str and unicode.

The general Python approach to type-checking is known as duck typing: if it walks like a duck and quacks like a duck, it’s duck-like enough for our purposes. The isStringLike function in this recipe goes only as far as the quacks-like part, but that may be enough. If and when you need to check for more string-like features of the object anobj, it’s easy to test a few more properties by using a richer expression in the try clause—for example, changing the clause to:

    try: anobj.lower( ) + anobj + ''

In my experience, however, the simple test shown in the isStringLike function usually does what I need.

The most Pythonic approach to type validation (or any validation task, really) is just to try to perform whatever task you need to do, detecting and handling any errors or exceptions that might result if the situation is somehow invalid—an approach known as “it’s easier to ask forgiveness than permission” (EAFP). try/except is the key tool in enabling the EAFP style. Sometimes, as in this recipe, you may choose some simple task, such as concatenation to the empty string, as a stand-in for a much richer set of properties (such as, all the wealth of operations and methods that string objects make available).

See Also

Documentation for the built-ins isinstance and basestring in the Library Reference and Python in a Nutshell.

1.4. Aligning Strings

Credit: Luther Blissett

Problem

You want to align strings: left, right, or center.

Solution

That’s what the ljust, rjust, and center methods of string objects are for. Each takes a single argument, the width of the string you want as a result, and returns a copy of the starting string with spaces added on either or both sides:

>>> print '|', 'hej'.ljust(20), '|', 'hej'.rjust(20), '|', 'hej'.center(20), '|'| hej             |             hej |       hej       |

Discussion

Centering, left-justifying, or right-justifying text comes up surprisingly often—for example, when you want to print a simple report with centered page numbers in a monospaced font. Because of this, Python string objects supply this functionality through three of their many methods. In Python 2.3, the padding character is always a space. In Python 2.4, however, while space-padding is still the default, you may optionally call any of these methods with a second argument, a single character to be used for the padding:

>>> print 'hej'.center(20, '+')++++++++hej+++++++++

See Also

The Library Reference section on string methods; Java Cookbook recipe 3.5.

1.5. Trimming Space from the Ends of a String

Credit: Luther Blissett

Problem

You need to work on a string without regard for any extra leading or trailing spaces a user may have typed.

Solution

That’s what the lstrip, rstrip, and strip methods of string objects are for. Each takes no argument and returns a copy of the starting string, shorn of whitespace on either or both sides:

>>> x = '    hej   '
>>> print '|', x.lstrip( ), '|', x.rstrip( ), '|', x.strip( ), '|'| hej    |     hej | hej |

Discussion

Just as you may need to add space to either end of a string to align that string left, right, or center in a field of fixed width (as covered previously in Recipe 1.4), so may you need to remove all whitespace (blanks, tabs, newlines, etc.) from either or both ends. Because this need is frequent, Python string objects supply this functionality through three of their many methods. Optionally, you may call each of these methods with an argument, a string composed of all the characters you want to trim from either or both ends instead of trimming whitespace characters:

>>> x = 'xyxxyy hejyx  yyx'
>>> print '|'+x.strip('xy')+'|'| hejyx  |

Note that in these cases the leading and trailing spaces have been left in the resulting string, as have the 'yx' that are followed by spaces: only all the occurrences of 'x' and 'y' at either end of the string have been removed from the resulting string.

See Also

The Library Reference section on string methods; Recipe 1.4; Java Cookbook recipe 3.12.

1.6. Combining Strings

Credit: Luther Blissett

Problem

You have several small strings that you need to combine into one larger string.

Solution

To join a sequence of small strings into one large string, use the string operator join. Say that pieces is a list whose items are strings, and you want one big string with all the items concatenated in order; then, you should code:

largeString = ''.join(pieces)

To put together pieces stored in a few variables, the string-formatting operator % can often be even handier:

largeString = '%s%s something %s yet more' % (small1, small2, small3)

Discussion

In Python, the + operator concatenates strings and therefore offers seemingly obvious solutions for putting small strings together into a larger one. For example, when you have pieces stored in a few variables, it seems quite natural to code something like:

largeString = small1 + small2 + ' something ' + small3 + ' yet more'

And similarly, when you have a sequence of small strings named pieces, it seems quite natural to code something like:

largeString = ''
for piece in pieces:
    largeString += piece

Or, equivalently, but more fancifully and compactly:

import operator
largeString = reduce(operator.add, pieces, '')

However, it’s very important to realize that none of these seemingly obvious solution is good—the approaches shown in the “Solution” are vastly superior.

In Python, string objects are immutable. Therefore, any operation on a string, including string concatenation, produces a new string object, rather than modifying an existing one. Concatenating N strings thus involves building and then immediately throwing away each of N-1 intermediate results. Performance is therefore vastly better for operations that build no intermediate results, but rather produce the desired end result at once.

Python’s string-formatting operator % is one such operation, particularly suitable when you have a few pieces (e.g., each bound to a different variable) that you want to put together, perhaps with some constant text in addition. Performance is not a major issue for this specific kind of task. However, the % operator also has other potential advantages, when compared to an expression that uses multiple + operations on strings. % is more readable, once you get used to it. Also, you don’t have to call str on pieces that aren’t already strings (e.g., numbers), because the format specifier %s does so implicitly. Another advantage is that you can use format specifiers other than %s, so that, for example, you can control how many significant digits the string form of a floating-point number should display.

When you have many small string pieces in a sequence, performance can become a truly important issue. The time needed to execute a loop using + or += (or a fancier but equivalent approach using the built-in function reduce) grows with the square of the number of characters you are accumulating, since the time to allocate and fill a large string is roughly proportional to the length of that string. Fortunately, Python offers an excellent alternative. The join method of a string object s takes as its only argument a sequence of strings and produces a string result obtained by concatenating all items in the sequence, with a copy of s joining each item to its neighbors. For example, ''.join(pieces) concatenates all the items of pieces in a single gulp, without interposing anything between them, and ', '.join(pieces) concatenates the items putting a comma and a space between each pair of them. It’s the fastest, neatest, and most elegant and readable way to put a large string together.

When the pieces are not all available at the same time, but rather come in sequentially from input or computation, use a list as an intermediate data structure to hold the pieces (to add items at the end of a list, you can call the append or extend methods of the list). At the end, when the list of pieces is complete, call ''.join(thelist) to obtain the big string that’s the concatenation of all pieces. Of all the many handy tips and tricks I could give you about Python strings, I consider this one by far the most significant: the most frequent reason some Python programs are too slow is that they build up big strings with + or +=. So, train yourself never to do that. Use, instead, the ''.join approach recommented in this recipe.

Python 2.4 makes a heroic attempt to ameliorate the issue, reducing a little the performance penalty due to such erroneous use of +=. While ''.join is still way faster and in all ways preferable, at least some newbie or careless programmer gets to waste somewhat fewer machine cycles. Similarly, psyco (a specializing just-in-time [JIT] Python compiler found at http://psyco.sourceforge.net/), can reduce the += penalty even further. Nevertheless, ''.join remains the best approach in all cases.

See Also

The Library Reference and Python in a Nutshell sections on string methods, string-formatting operations, and the operator module.

1.7. Reversing a String by Words or Characters

Credit: Alex Martelli

Problem

You want to reverse the characters or words in a string.

Solution

Strings are immutable, so, to reverse one, we need to make a copy. The simplest approach for reversing is to take an extended slice with a “step” of -1, so that the slicing proceeds backwards:

revchars = astring[::-1]

To flip words, we need to make a list of words, reverse it, and join it back into a string with a space as the joiner:

revwords = astring.split( )     # string -> list of words
revwords.reverse( )             # reverse the list in place
revwords = ' '.join(revwords)  # list of strings -> string

or, if you prefer terse and compact “one-liners”:

revwords = ' '.join(astring.split( )[::-1])

If you need to reverse by words while preserving untouched the intermediate whitespace, you can split by a regular expression:

import re
revwords = re.split(r'(\s+)', astring)         # separators too, since '(...)'
revwords.reverse( )        # reverse the list in place
revwords = ''.join(revwords)        # list of strings -> string

Note that the joiner must be the empty string in this case, because the whitespace separators are kept in the revwords list (by using re.split with a regular expression that includes a parenthesized group). Again, you could make a one-liner, if you wished:

revwords = ''.join(re.split(r'(\s+)', astring)[::-1])

but this is getting too dense and unreadable to be good Python code!

Discussion

In Python 2.4, you may make the by-word one-liners more readable by using the new built-in function reversed instead of the less readable extended-slicing indicator [::-1]:

revwords = ' '.join(reversed(astring.split( )))
revwords = ''.join(reversed(re.split(r'(\s+)', astring)))

For the by-character case, though, astring[::-1] remains best, even in 2.4, because to use reversed, you’d have to introduce a call to ''.join as well:

revchars = ''.join(reversed(astring))

The new reversed built-in returns an iterator, suitable for looping on or for passing to some “accumulator” callable such as ''.join—it does not return a ready-made string!

See Also

Library Reference and Python in a Nutshell docs on sequence types and slicing, and (2.4 only) the reversed built-in; Perl Cookbook recipe 1.6.

1.8. Checking Whether a String Contains a Set of Characters

Credit: Jürgen Hermann, Horst Hansen

Problem

You need to check for the occurrence of any of a set of characters in a string.

Solution

The simplest approach is clear, fast, and general (it works for any sequence, not just strings, and for any container on which you can test for membership, not just sets):

def containsAny(seq, aset):
    """ Check whether sequence seq contains ANY of the items in aset. """
    for c in seq:
        if c in aset: return True
    return False

You can gain a little speed by moving to a higher-level, more sophisticated approach, based on the itertools standard library module, essentially expressing the same approach in a different way:

import itertools
def containsAny(seq, aset):
    for item in itertools.ifilter(aset._ _contains_ _, seq):
        return True
    return False

Discussion

Most problems related to sets are best handled by using the set built-in type introduced in Python 2.4 (if you’re using Python 2.3, you can use the equivalent sets.Set type from the Python Standard Library). However, there are exceptions. Here, for example, a pure set-based approach would be something like:

def containsAny(seq, aset):
    return bool(set(aset).intersection(seq))

However, with this approach, every item in seq inevitably has to be examined. The functions in this recipe’s Solution, on the other hand, “short-circuit”: they return as soon as they know the answer. They must still check every item in seq when the answer is False—we could never affirm that no item in seq is a member of aset without examining all the items, of course. But when the answer is True, we often learn about that very soon, namely as soon as we examine one item that is a member of aset. Whether this matters at all is very data-dependent, of course. It will make no practical difference when seq is short, or when the answer is typically False, but it may be extremely important for a very long seq (when the answer can typically be soon determined to be True).

The first version of containsAny presented in the recipe has the advantage of simplicity and clarity: it expresses the fundamental idea with total transparency. The second version may appear to be “clever”, and that is not a complimentary adjective in the Python world, where simplicity and clarity are core values. However, the second version is well worth considering, because it shows a higher-level approach, based on the itertools module of the standard library. Higher-level approaches are most often preferable to lower-level ones (although the issue is moot in this particular case). itertools.ifilter takes a predicate and an iterable, and yields the items in that iterable that satisfy the “predicate”. Here, as the “predicate”, we use aset._ _contains_ _, the bound method that is internally called when we code in aset for membership testing. So, if ifilter yields anything at all, it yields an item of seq that is also a member of aset, so we can return True as soon as this happens. If we get to the statement following the for, it must mean the return True never executed, because no items of seq are members of aset, so we can return False.

If your application needs some function such as containsAny to check whether a string (or other sequence) contains any members of a set, you may also need such variants as:

def containsOnly(seq, aset):
    """ Check whether sequence seq contains ONLY items in aset. """
    for c in seq:
        if c not in aset: return False
    return True

containsOnly is the same function as containsAny, but with the logic turned upside-down. Other apparently similar tasks don’t lend themselves to short-circuiting (they intrinsically need to examine all items) and so are best tackled by using the built-in type set (in Python 2.4; in 2.3, you can use sets.Set in the same way):

def containsAll(seq, aset):
    """ Check whether sequence seq contains ALL the items in aset. """
    return not set(aset).difference(seq)

If you’re not accustomed to using the set (or sets.Set) method difference, be aware of its semantics: for any set a, a.difference(b) (just like a-set(b)) returns the set of all elements of a that are not in b. For example:

>>> L1 = [1, 2, 3, 3]
>>> L2 = [1, 2, 3, 4]
>>> set(L1).difference(L2)set([  ])
>>> set(L2).difference(L1)
set([4])

which hopefully helps explain why:

>>> containsAll(L1, L2)False
>>> containsAll(L2, L1)
True

(In other words, don’t confuse difference with another method of set, symmetric_difference, which returns the set of all items that are in either argument and not in the other.)

When you’re dealing specifically with (plain, not Unicode) strings for both seq and aset, you may not need the full generality of the functions presented in this recipe, and may want to try the more specialized approach explained in Recipe 1.10 based on strings’ method translate and the string.maketrans function from the Python Standard Library. For example:

import string
notrans = string.maketrans('', '')           # identity "translation"
def containsAny(astr, strset):
    return len(strset) != len(strset.translate(notrans, astr))
def containsAll(astr, strset):
    return not strset.translate(notrans, astr)

This somewhat tricky approach relies on strset.translate(notrans, astr) being the subsequence of strset that is made of characters not in astr. When that subsequence has the same length as strset, no characters have been removed by strset.translate, therefore no characters of strset are in astr. Conversely, when the subsequence is empty, all characters have been removed, so all characters of strset are in astr. The translate method keeps coming up naturally when one wants to treat strings as sets of characters, because it’s speedy as well as handy and flexible; see Recipe 1.10 for more details.

These two sets of approaches to the recipe’s tasks have very different levels of generality. The earlier approaches are very general: not at all limited to string processing, they make rather minimal demands on the objects you apply them to. The approach based on the translate method, on the other hand, works only when both astr and strset are strings, or very closely mimic plain strings’ functionality. Not even Unicode strings suffice, because the translate method of Unicode strings has a signature that is different from that of plain strings—a single argument (a dict mapping code numbers to Unicode strings or None) instead of two (both strings).

See Also

Recipe 1.10; documentation for the translate method of strings and Unicode objects, and maketrans function in the string module, in the Library Reference and Python in a Nutshell; ditto for documentation of built-in set (Python 2.4 only), modules sets and itertools, and the special method _ _contains_ _.

1.9. Simplifying Usage of Strings’ translate Method

Credit: Chris Perkins, Raymond Hettinger

Problem

You often want to use the fast code in strings’ translate method, but find it hard to remember in detail how that method and the function string.maketrans work, so you want a handy facade to simplify their use in typical cases.

Solution

The translate method of strings is quite powerful and flexible, as detailed in Recipe 1.10. However, exactly because of that power and flexibility, it may be a nice idea to front it with a “facade” that simplifies its typical use. A little factory function, returning a closure, can do wonders for this kind of task:

import string
def translator(frm='', to='', delete='', keep=None):
    if len(to) == 1:
        to = to * len(frm)
    trans = string.maketrans(frm, to)
    if keep is not None:
        allchars = string.maketrans('', '')
        delete = allchars.translate(allchars, keep.translate(allchars, delete))
    def translate(s):
        return s.translate(trans, delete)
    return translate

Discussion

I often find myself wanting to use strings’ translate method for any one of a few purposes, but each time I have to stop and think about the details (see Recipe 1.10 for more information about those details). So, I wrote myself a class (later remade into the factory closure presented in this recipe’s Solution) to encapsulate various possibilities behind a simpler-to-use facade. Now, when I want a function that keeps only characters from a given set, I can easily build and use that function:

>>> digits_only = translator(keep=string.digits)
>>> digits_only('Chris Perkins : 224-7992')'2247992'

It’s similarly simple when I want to remove a set of characters:

>>> no_digits = translator(delete=string.digits)
>>> no_digits('Chris Perkins : 224-7992')'Chris Perkins : -'

and when I want to replace a set of characters with a single character:

>>> digits_to_hash = translator(from=string.digits, to='#')
>>> digits_to_hash('Chris Perkins : 224-7992')'Chris Perkins : ###-####'

While the latter may appear to be a bit of a special case, it is a task that keeps coming up for me every once in a while.

I had to make one arbitrary design decision in this recipe—namely, I decided that the delete parameter “trumps” the keep parameter if they overlap:

>>> trans = translator(delete='abcd', keep='cdef')
>>> trans('abcdefg')'ef'

For your applications it might be preferable to ignore delete if keep is specified, or, perhaps better, to raise an exception if they are both specified, since it may not make much sense to let them both be given in the same call to translator, anyway. Also: as noted in Recipe 1.8 and Recipe 1.10, the code in this recipe works only for normal strings, not for Unicode strings. See Recipe 1.10 to learn how to code this kind of functionality for Unicode strings, whose translate method is different from that of plain (i.e., byte) strings.

See Also

Recipe 1.10 for a direct equivalent of this recipe’s translator(keep=...), more information on the translate method, and an equivalent approach for Unicode strings; documentation for strings’ translate method, and for the maketrans function in the string module, in the Library Reference and Python in a Nutshell.

1.10. Filtering a String for a Set of Characters

Credit: Jürgen Hermann, Nick Perkins, Peter Cogolo

Problem

Given a set of characters to keep, you need to build a filtering function that, applied to any string s, returns a copy of s that contains only characters in the set.

Solution

The translate method of string objects is fast and handy for all tasks of this ilk. However, to call translate effectively to solve this recipe’s task, we must do some advance preparation. The first argument to translate is a translation table: in this recipe, we do not want to do any translation, so we must prepare a first argument that specifies “no translation”. The second argument to translate specifies which characters we want to delete: since the task here says that we’re given, instead, a set of characters to keep (i.e., to not delete), we must prepare a second argument that gives the set complement—deleting all characters we must not keep. A closure is the best way to do this advance preparation just once, obtaining a fast filtering function tailored to our exact needs:

import string
# Make a reusable string of all characters, which does double duty
# as a translation table specifying "no translation whatsoever"allchars = string.maketrans('', '')
def makefilter(keep):
    """ Return a function that takes a string and returns a partial copy
        of that string consisting of only the characters in 'keep'.
        Note that `keep' must be a plain string.
    """
    # Make a string of all characters that are not in 'keep': the "set
    # complement" of keep, meaning the string of characters we must delete
    delchars = allchars.translate(allchars, keep)
    # Make and return the desired filtering function (as a closure)
    def thefilter(s):
        return s.translate(allchars, delchars)
    return thefilter
if _ _name_ _ == '_ _main_ _':
    just_vowels = makefilter('aeiouy')
    print just_vowels('four score and seven years ago')
# emits: ouoeaeeyeaao
    print just_vowels('tiger, tiger burning bright')
# emits: ieieuii

Discussion

The key to understanding this recipe lies in the definitions of the maketrans function in the string module of the Python Standard Library and in the translate method of string objects. translate returns a copy of the string you call it on, replacing each character in it with the corresponding character in the translation table passed in as the first argument and deleting the characters specified in the second argument. maketrans is a utility function to create translation tables. (A translation table is a string t of exactly 256 characters: when you pass t as the first argument of a translate method, each character c of the string on which you call the method is translated in the resulting string into the character t[ord(c)].)

In this recipe, efficiency is maximized by splitting the filtering task into preparation and execution phases. The string of all characters is clearly reusable, so we build it once and for all as a global variable when this module is imported. That way, we ensure that each filtering function uses the same string-of-all-characters object, not wasting any memory. The string of characters to delete, which we need to pass as the second argument to the translate method, depends on the set of characters to keep, because it must be built as the “set complement” of the latter: we must tell translate to delete every character that we do not want to keep. So, we build the delete-these-characters string in the makefilter factory function. This building is done quite rapidly by using the translate method to delete the “characters to keep” from the string of all characters. The translate method is very fast, as are the construction and execution of these useful little resulting functions. The test code that executes when this recipe runs as a main script shows how to build a filtering function by calling makefilter, bind a name to the filtering function (by simply assigning the result of calling makefilter to a name), then call the filtering function on some strings and print the results.

Incidentally, calling a filtering function with allchars as the argument puts the set of characters being kept into a canonic string form, alphabetically sorted and without duplicates. You can use this idea to code a very simple function to return the canonic form of any set of characters presented as an arbitrary string:

def canonicform(s):
    """ Given a string s, return s's characters as a canonic-form string:
        alphabetized and without duplicates. """
    return makefilter(s)(allchars)

The Solution uses a def statement to make the nested function (closure) it returns, because def is the most normal, general, and clear way to make functions. If you prefer, you could use lambda instead, changing the def and return statements in function makefilter into just one return lambda statement:

    return lambda s: s.translate(allchars, delchars)

Most Pythonistas, but not all, consider using def clearer and more readable than using lambda.

Since this recipe deals with strings seen as sets of characters, you could alternatively use the sets.Set type (or, in Python 2.4, the new built-in set type) to perform the same tasks. Thanks to the translate method’s power and speed, it’s often faster to work directly on strings, rather than go through sets, for tasks of this ilk. However, just as noted in Recipe 1.8, the functions in this recipe only work for normal strings, not for Unicode strings.

To solve this recipe’s task for Unicode strings, we must do some very different preparation. A Unicode string’s translate method takes only one argument: a mapping or sequence, which is indexed with the code number of each character in the string. Characters whose codes are not keys in the mapping (or indices in the sequence) are just copied over to the output string. Otherwise, the value corresponding to each character’s code must be either a Unicode string (which is substituted for the character) or None (in which case the character is deleted). A very nice and powerful arrangement, but unfortunately not one that’s identical to the way plain strings work, so we must recode.

Normally, we use either a dict or a list as the argument to a Unicode string’s translate method to translate some characters and/or delete some. But for the specific task of this recipe (i.e., keep just some characters, delete all others), we might need an inordinately large dict or string, just mapping all other characters to None. It’s better to code, instead, a little class that appropriately implements a _ _getitem_ _ method (the special method that gets called in indexing operations). Once we’re going to the (slight) trouble of coding a little class, we might as well make its instances callable and have makefilter be just a synonym for the class itself:

import sets
class Keeper(object):
    def _ _init_ _(self, keep):
        self.keep = sets.Set(map(ord, keep))
    def _ _getitem_ _(self, n):
        if n not in self.keep:
            return None
        return unichr(n)
    def _ _call_ _(self, s):
        return unicode(s).translate(self)
makefilter = Keeper
if _ _name_ _ == '_ _main_ _':
    just_vowels = makefilter('aeiouy')
    print just_vowels(u'four score and seven years ago')
# emits: ouoeaeeyeaao
    print just_vowels(u'tiger, tiger burning bright')
# emits: ieieuii

We might name the class itself makefilter, but, by convention, one normally names classes with an uppercase initial; there is essentially no cost in following that convention here, too, so we did.

See Also

Recipe 1.8; documentation for the translate method of strings and Unicode objects, and maketrans function in the string module, in the Library Reference and Python in a Nutshell.

1.11. Checking Whether a String Is Text or Binary

Credit: Andrew Dalke

Problem

Python can use a plain string to hold either text or arbitrary bytes, and you need to determine (heuristically, of course: there can be no precise algorithm for this) which of the two cases holds for a certain string.

Solution

We can use the same heuristic criteria as Perl does, deeming a string binary if it contains any nulls or if more than 30% of its characters have the high bit set (i.e., codes greater than 126) or are strange control codes. We have to code this ourselves, but this also means we easily get to tweak the heuristics for special application needs:

from _ _future_ _ import division           # ensure / does NOT truncate
import string
text_characters = "".join(map(chr, range(32, 127))) + "\n\r\t\b"
_null_trans = string.maketrans("", "")
def istext(s, text_characters=text_characters, threshold=0.30):
    # if s contains any null, it's not text:
    if "\0" in s:
        return False
    # an "empty" string is "text" (arbitrary but reasonable choice):
    if not s:
        return True
    # Get the substring of s made up of non-text characters
    t = s.translate(_null_trans, text_characters)
    # s is 'text' if less than 30% of its characters are non-text ones:
    return len(t)/len(s) <= threshold

Discussion

You can easily do minor customizations to the heuristics used by function istext by passing in specific values for the threshold, which defaults to 0.30 (30%), or for the string of those characters that are to be deemed “text” (which defaults to normal ASCII characters plus the four “normal” control characters, meaning ones that are often found in text). For example, if you expected Italian text encoded as ISO-8859-1, you could add the accented letters used in Italian, "àèéìòù “, to the text_characters argument.

Often, what you need to check as being either binary or text is not a string, but a file. Again, we can use the same heuristics as Perl, checking just the first block of the file with the istext function shown in this recipe’s Solution:

def istextfile(filename, blocksize=512, **kwds):
    return istext(open(filename).read(blocksize), **kwds)

Note that, by default, the expression len(t)/len(s) used in the body of function istext would truncate the result to 0, since it is a division between integer numbers. In some future version (probably Python 3.0, a few years away), Python will change the meaning of the / operator so that it performs division without truncation—if you really do want truncation, you should use the truncating-division operator, //.

However, Python has not yet changed the semantics of division, keeping the old one by default in order to ensure backwards compatibility. It’s important that the millions of lines of code of Python programs and modules that already exist keep running smoothly under all new 2.x versions of Python—only upon a change of major language version number, no more often than every decade or so, is Python allowed to change in ways that aren’t backwards-compatible.

Since, in the small module containing this recipe’s Solution, it’s handy for us to get the division behavior that is scheduled for introduction in some future release, we start our module with the statement:

from _ _future_ _ import division

This statement doesn’t affect the rest of the program, only the specific module that starts with this statement; throughout this module, / performs “true division” (without truncation). As of Python 2.3 and 2.4, division is the only thing you may want to import from _ _future_ _. Other features that used to be scheduled for the future, nested_scopes and generators, are now part of the language and cannot be turned off—it’s innocuous to import them, but it makes sense to do so only if your program also needs to run under some older version of Python.

See Also

Recipe 1.10 for more details about function maketrans and string method translate; Language Reference for details about true versus truncating division.

1.12. Controlling Case

Credit: Luther Blissett

Problem

You need to convert a string from uppercase to lowercase, or vice versa.

Solution

That’s what the upper and lower methods of string objects are for. Each takes no arguments and returns a copy of the string in which each letter has been changed to upper- or lowercase, respectively.

big = little.upper( )
little = big.lower( )

Characters that are not letters are copied unchanged.

s.capitalize is similar to s[:1].upper( )+s[1:].lower( ): the first character is changed to uppercase, and all others are changed to lowercase. s.title is again similar, but it capitalizes the first letter of each word (where a “word” is a sequence of letters) and uses lowercase for all other letters:

>>> print 'one tWo thrEe'.capitalize( )One two three
>>> print 'one tWo thrEe'.title( )
One Two Three

Discussion

Case manipulation of strings is a very frequent need. Because of this, several string methods let you produce case-altered copies of strings. Moreover, you can also check whether a string object is already in a given case form, with the methods isupper, islower, and istitle, which all return True if the string is not empty, contains at least one letter, and already meets the uppercase, lowercase, or titlecase constraints. There is no analogous iscapitalized method, and coding it is not trivial, if we want behavior that’s strictly similar to strings’ is... methods. Those methods all return False for an “empty” string, and the three case-checking ones also return False for strings that, while not empty, contain no letters at all.

The simplest and clearest way to code iscapitalized is clearly:

def iscapitalized(s):
    return s == s.capitalize( )

However, this version deviates from the boundary-case semantics of the analogous is... methods, since it also returns True for strings that are empty or contain no letters. Here’s a stricter one:

import string
notrans = string.maketrans('', '')  # identity "translation"
def containsAny(str, strset):
    return len(strset) != len(strset.translate(notrans, str))
def iscapitalized(s):
    return s == s.capitalize( ) and containsAny(s, string.letters)

Here, we use the function shown in Recipe 1.8 to ensure we return False if s is empty or contains no letters. As noted in Recipe 1.8, this means that this specific version works only for plain strings, not for Unicode ones.

See Also

Library Reference and Python in a Nutshell docs on string methods; Perl Cookbook recipe 1.9; Recipe 1.8.

1.13. Accessing Substrings

Credit: Alex Martelli

Problem

You want to access portions of a string. For example, you’ve read a fixed-width record and want to extract the record’s fields.

Solution

Slicing is great, but it only does one field at a time:

afield = theline[3:8]

If you need to think in terms of field lengths, struct.unpack may be appropriate. For example:

import struct
# Get a 5-byte string, skip 3, get two 8-byte strings, then all the rest:
baseformat = "5s 3x 8s 8s"
# by how many bytes does theline exceed the length implied by this
# base-format (24 bytes in this case, but struct.calcsize is general)
numremain = len(theline) - struct.calcsize(baseformat)
# complete the format with the appropriate 's' field, then unpack
format = "%s %ds" % (baseformat, numremain)
l, s1, s2, t = struct.unpack(format, theline)

If you want to skip rather than get "all the rest“, then just unpack the initial part of theline with the right length:

l, s1, s2 = struct.unpack(baseformat, theline[:struct.calcsize(baseformat)])

If you need to split at five-byte boundaries, you can easily code a list comprehension (LC) of slices:

fivers = [theline[k:k+5] for k in xrange(0, len(theline), 5)]

Chopping a string into individual characters is of course easier:

chars = list(theline)

If you prefer to think of your data as being cut up at specific columns, slicing with LCs is generally handier:

cuts = [8, 14, 20, 26, 30]
pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]

The call to zip in this LC returns a list of pairs of the form (cuts[k], cuts[k+1]), except that the first pair is (0, cuts[0]), and the last one is (cuts[len(cuts)-1], None). In other words, each pair gives the right (i, j) for slicing between each cut and the next, except that the first one is for the slice before the first cut, and the last one is for the slice from the last cut to the end of the string. The rest of the LC just uses these pairs to cut up the appropriate slices of theline.

Discussion

This recipe was inspired by recipe 1.1 in the Perl Cookbook. Python’s slicing takes the place of Perl’s substr. Perl’s built-in unpack and Python’s struct.unpack are similar. Perl’s is slightly richer, since it accepts a field length of * for the last field to mean all the rest. In Python, we have to compute and insert the exact length for either extraction or skipping. This isn’t a major issue because such extraction tasks will usually be encapsulated into small functions. Memoizing, also known as automatic caching, may help with performance if the function is called repeatedly, since it allows you to avoid redoing the preparation of the format for the struct unpacking. See Recipe 18.5 for details about memoizing.

In a purely Python context, the point of this recipe is to remind you that struct.unpack is often viable, and sometimes preferable, as an alternative to string slicing (not quite as often as unpack versus substr in Perl, given the lack of a *-valued field length, but often enough to be worth keeping in mind).

Each of these snippets is, of course, best encapsulated in a function. Among other advantages, encapsulation ensures we don’t have to work out the computation of the last field’s length on each and every use. This function is the equivalent of the first snippet using struct.unpack in the “Solution”:

def fields(baseformat, theline, lastfield=False):
    # by how many bytes does theline exceed the length implied by
    # base-format (struct.calcsize computes exactly that length)
    numremain = len(theline)-struct.calcsize(baseformat)
    # complete the format with the appropriate 's' or 'x' field, then unpack
    format = "%s %d%s" % (baseformat, numremain, lastfield and "s" or "x")
    return struct.unpack(format, theline)

A design decision worth noticing (and, perhaps, worth criticizing) is that of having a lastfield=False optional parameter. This reflects the observation that, while we often want to skip the last, unknown-length subfield, sometimes we want to retain it instead. The use of lastfield in the expression lastfield and s or x (equivalent to C’s ternary operator lastfield?"s“:”c“) saves an if/else, but it’s unclear whether the saving is worth the obscurity. See Recipe 18.9 for more about simulating ternary operators in Python.

If function fields is called in a loop, memoizing (caching) with a key that is the tuple (baseformat, len(theline), lastfield) may offer faster performance. Here’s a version of fields with memoizing:

def fields(baseformat, theline, lastfield=False, _cache={  }):
    # build the key and try getting the cached format string
    key = baseformat, len(theline), lastfield
    format = _cache.get(key)
    if format is None:
        # no format string was cached, build and cache it
        numremain = len(theline)-struct.calcsize(baseformat)
        _cache[key] = format = "%s %d%s" % (
            baseformat, numremain, lastfield and "s" or "x")
    return struct.unpack(format, theline)

The idea behind this memoizing is to perform the somewhat costly preparation of format only once for each set of arguments requiring that preparation, thereafter storing it in the _cache dictionary. Of course, like all optimizations, memoizing needs to be validated by measuring performance to check that each given optimization does actually speed things up. In this case, I measure an increase in speed of approximately 30% to 40% for the memoized version, meaning that the optimization is probably not worth the bother unless the function is part of a performance bottleneck for your program.

The function equivalent of the next LC snippet in the solution is:

def split_by(theline, n, lastfield=False):
    # cut up all the needed pieces
    pieces = [theline[k:k+n] for k in xrange(0, len(theline), n)]
    # drop the last piece if too short and not required
    if not lastfield and len(pieces[-1]) < n:
        pieces.pop( )
    return pieces

And for the last snippet:

def split_at(theline, cuts, lastfield=False):
    # cut up all the needed pieces
    pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]
    # drop the last piece if not required
    if not lastfield:
        pieces.pop( )
    return pieces

In both of these cases, a list comprehension doing slicing turns out to be slightly preferable to the use of struct.unpack.

A completely different approach is to use generators, such as:

def split_at(the_line, cuts, lastfield=False):
    last = 0
    for cut in cuts:
        yield the_line[last:cut]
        last = cut
    if lastfield:
        yield the_line[last:]
def split_by(the_line, n, lastfield=False):
    return split_at(the_line, xrange(n, len(the_line), n), lastfield)

Generator-based approaches are particularly appropriate when all you need to do on the sequence of resulting fields is loop over it, either explicitly, or implicitly by calling on it some “accumulator” callable such as ''.join. If you do need to materialize a list of the fields, and what you have available is a generator instead, you only need to call the built-in list on the generator, as in:

list_of_fields = list(split_by(the_line, 5))

See Also

Recipe 18.9 and Recipe 18.5; Perl Cookbook recipe 1.1.

1.14. Changing the Indentation of a Multiline String

Credit: Tom Good

Problem

You have a string made up of multiple lines, and you need to build another string from it, adding or removing leading spaces on each line so that the indentation of each line is some absolute number of spaces.

Solution

The methods of string objects are quite handy, and let us write a simple function to perform this task:

def reindent(s, numSpaces):
    leading_space = numSpaces * ' '
    lines = [ leading_space + line.strip( )
              for line in s.splitlines( ) ]
    return '\n'.join(lines)

Discussion

When working with text, it may be necessary to change the indentation level of a block. This recipe’s code adds leading spaces to or removes them from each line of a multiline string so that the indentation level of each line matches some absolute number of spaces. For example:

>>> x = """  line one
...     line two
...  and line three
... """
>>> print x  line one
                   line two
                and line three
>>> print reindent(x, 4)
    line one
                   line two
                   and line three

Even if the lines in s are initially indented differently, this recipe makes their indentation homogeneous, which is sometimes what we want, and sometimes not. A frequent need is to adjust the amount of leading spaces in each line, so that the relative indentation of each line in the block is preserved. This is not difficult for either positive or negative values of the adjustment. However, negative values need a check to ensure that no nonspace characters are snipped from the start of the lines. Thus, we may as well split the functionality into two functions to perform the transformations, plus one to measure the number of leading spaces of each line and return the result as a list:

def addSpaces(s, numAdd):
    white = " "*numAdd
    return white + white.join(s.splitlines(True))
def numSpaces(s):
    return [len(line)-len(line.lstrip( )) for line in s.splitlines( )]
def delSpaces(s, numDel):
    if numDel > min(numSpaces(s)):
        raise ValueError, "removing more spaces than there are!"
    return '\n'.join([ line[numDel:] for line in s.splitlines( ) ])

All of these functions rely on the string method splitlines, which is similar to a split on '\n'. splitlines has the extra ability to leave the trailing newline on each line (when you call it with True as its argument). Sometimes this turns out to be handy: addSpaces could not be quite as short and sweet without this ability of the splitlines string method.

Here’s how we can combine these functions to build another function to delete just enough leading spaces from each line to ensure that the least-indented line of the block becomes flush left, while preserving the relative indentation of the lines:

def unIndentBlock(s):
    return delSpaces(s, min(numSpaces(s)))

See Also

Library Reference and Python in a Nutshell docs on sequence types.

1.15. Expanding and Compressing Tabs

Credit: Alex Martelli, David Ascher

Problem

You want to convert tabs in a string to the appropriate number of spaces, or vice versa.

Solution

Changing tabs to the appropriate number of spaces is a reasonably frequent task, easily accomplished with Python strings’ expandtabs method. Because strings are immutable, the method returns a new string object, a modified copy of the original one. However, it’s easy to rebind a string variable name from the original to the modified-copy value:

mystring = mystring.expandtabs( )

This doesn’t change the string object to which mystring originally referred, but it does rebind the name mystring to a newly created string object, a modified copy of mystring in which tabs are expanded into runs of spaces. expandtabs, by default, uses a tab length of 8; you can pass expandtabs an integer argument to use as the tab length.

Changing spaces into tabs is a rare and peculiar need. Compression, if that’s what you’re after, is far better performed in other ways, so Python doesn’t offer a built-in way to “unexpand” spaces into tabs. We can, of course, write our own function for the purpose. String processing tends to be fastest in a split/process/rejoin approach, rather than with repeated overall string transformations:

def unexpand(astring, tablen=8):
    import re
    # split into alternating space and non-space sequences
    pieces = re.split(r'( +)', astring.expandtabs(tablen))
    # keep track of the total length of the string so far
    lensofar = 0
    for i, piece in enumerate(pieces):
        thislen = len(piece)
        lensofar += thislen
        if piece.isspace( ):
            # change each space sequences into tabs+spaces
            numblanks = lensofar % tablen
            numtabs = (thislen-numblanks+tablen-1)/tablen
            pieces[i] = '\t'*numtabs + ' '*numblanks
    return ''.join(pieces)

Function unexpand, as written in this example, works only for a single-line string; to deal with a multi-line string, use ''.join([ unexpand(s) for s in astring.splitlines(True) ]).

Discussion

While regular expressions are never indispensable for the purpose of manipulating strings in Python, they are occasionally quite handy. Function unexpand, as presented in the recipe, for example, takes advantage of one extra feature of re.split with respect to string’s split method: when the regular expression contains a (parenthesized) group, re.split returns a list where the split pieces are interleaved with the “splitter” pieces. So, here, we get alternate runs of nonblanks and blanks as items of list pieces; the for loop keeps track of the length of string it has seen so far, and changes pieces that are made of blanks to as many tabs as possible, plus as many blanks are needed to maintain the overall length.

Some programming tasks that could still be described as expanding tabs are unfortunately not quite as easy as just calling the expandtabs method. A category that does happen with some regularity is to fix Python source files, which use a mix of tabs and spaces for indentation (a very bad idea), so that they instead use spaces only (which is the best approach). This could entail extra complications, for example, when you need to guess the tab length (and want to end up with the standard four spaces per indentation level, which is strongly advisable). It can also happen when you need to preserve tabs that are inside strings, rather than tabs being used for indentation (because somebody erroneously used actual tabs, rather than '\t', to indicate tabs in strings), or even because you’re asked to treat docstrings differently from other strings. Some cases are not too bad—for example, when you want to expand tabs that occur only within runs of whitespace at the start of each line, leaving any other tab alone. A little function using a regular expression suffices:

def expand_at_linestart(P, tablen=8):
    import re
    def exp(mo):
        return mo.group( ).expand(tablen)
    return ''.join([ re.sub(r'^\s+', exp, s) for s in P.splitlines(True) ])

This function expand_at_linestart exploits the re.sub function, which looks for a regular expression in a string and, each time it gets a match, calls a function, passing the match object as the argument, to obtain the string to substitute in place of the match. For convenience, expand_at_linestart is coded to deal with a multiline string argument P, performing the list comprehension over the results of the splitlines call, and the '\n'.join of the whole. Of course, this convenience does not stop the function from being able to deal with a single-line P.

If your specifications regarding which tabs are to be expanded are even more complex, such as needing to deal differently with tabs depending on whether they’re inside or outside of strings, and on whether or not strings are docstrings, at the very least, you need to perform a tokenization. In addition, you may also have to perform a full parse of the source code you’re dealing with, rather than using simple string or regular-expression operations. If this is the case, you can expect a substantial amount of work. Some beginning pointers to help you get started may be found in Chapter 16.

If you ever find yourself sweating out this kind of task, you will no doubt get excellent motivation in the future for following the normal and recommended Python style in the source code you write or edit: only spaces, four per indentation level, no tabs, and always '\t', never an actual tab character, to include a tab in a string literal. Your favorite editor can no doubt be told to enforce all of these conventions whenever a Python source file is saved; the editor that comes with IDLE (the free integrated development environment that comes with Python), for example, supports these conventions. It is much easier to arrange your editor so that the problem never arises, rather than striving to fix it after the fact!

See Also

Documentation for the expandtabs method of strings in the “Sequence Types” section of the Library Reference; Perl Cookbook recipe 1.7; Library Reference and Python in a Nutshell documentation of module re.

1.16. Interpolating Variables in a String

Credit: Scott David Daniels

Problem

You need a simple way to get a copy of a string where specially marked substrings are replaced with the results of looking up the substrings in a dictionary.

Solution

Here is a solution that works in Python 2.3 as well as in 2.4:

def expand(format, d, marker='"', safe=False):
    if safe:
        def lookup(w): return d.get(w, w.join(marker*2))
    else:
        def lookup(w): return d[w]
    parts = format.split(marker)
    parts[1::2] = map(lookup, parts[1::2])
    return ''.join(parts)
if _ _name_ _ == '_ _main_ _':
    print expand('just "a" test', {'a': 'one'})
# emits:just one test

When the parameter safe is False, the default, every marked substring must be found in dictionary d, otherwise expand terminates with a KeyError exception. When parameter safe is explicitly passed as True, marked substrings that are not found in the dictionary are just left intact in the output string.

Discussion

The code in the body of the expand function has some points of interest. It defines one of two different nested functions (with the name of lookup either way), depending on whether the expansion is required to be safe. Safe means no KeyError exception gets raised for marked strings not found in the dictionary. If not required to be safe (the default), lookup just indexes into dictionary d and raises an error if the substring is not found. But, if lookup is required to be “safe”, it uses d’s method get and supplies as the default the substring being looked up, with a marker on either side. In this way, by passing safe as True, you may choose to have unknown formatting markers come right through to the output rather than raising exceptions. marker+w+marker would be an obvious alternative to the chosen w.join(marker*2), but I’ve chosen the latter exactly to display a non-obvious but interesting way to construct such a quoted string.

With either version of lookup, expand operates according to the split/modify/join idiom that is so important for Python string processing. The modify part, in expand’s case, makes use of the possibility of accessing and modifying a list’s slice with a “step” or “stride”. Specifically, expand accesses and rebinds all of those items of parts that lie at an odd index, because those items are exactly the ones that were enclosed between a pair of markers in the original format string. Therefore, they are the marked substrings that may be looked up in the dictionary.

The syntax of format strings accepted by this recipe’s function expand is more flexible than the $-based syntax of string.Template. You can specify a different marker when you want your format string to contain double quotes, for example. There is no constraint for each specially marked substring to be an identifier, so you can easily interpolate Python expressions (with a d whose _ _getitem_ _ performs an eval) or any other kind of placeholder. Moreover, you can easily get slightly different, useful effects. For example:

print expand('just "a" ""little"" test', {'a' : 'one', '' : '"'})

emits just one "little" test. Advanced users can customize Python 2.4’s string.Template class, by inheritance, to match all of these capabilities, and more, but this recipe’s little expand function is still simpler to use in some flexible ways.

See Also

Library Reference docs for string.Template (Python 2.4, only), the section on sequence types (for string methods split and join, and for slicing operations), and the section on dictionaries (for indexing and the get method). For more information on Python 2.4’s string.Template class, see Recipe 1.17.

1.17. Interpolating Variables in a Stringin Python 2.4

Credit: John Nielsen, Lawrence Oluyede, Nick Coghlan

Problem

Using Python 2.4, you need a simple way to get a copy of a string where specially marked identifiers are replaced with the results of looking up the identifiers in a dictionary.

Solution

Python 2.4 offers the new string.Template class for this purpose. Here is a snippet of code showing how to use that class:

import string
# make a template from a string where some identifiers are marked with $
new_style = string.Template('this is $thing')
# use the substitute method of the template with a dictionary argument:
print new_style.substitute({'thing':5})      # emits: this is 5
print new_style.substitute({'thing':'test'}) # emits: this is test
# alternatively, you can pass keyword-arguments to 'substitute':
print new_style.substitute(thing=5)          # emits: this is 5
print new_style.substitute(thing='test')     # emits: this is test

Discussion

In Python 2.3, a format string for identifier-substitution has to be expressed in a less simple format:

old_style = 'this is %(thing)s'

with the identifier in parentheses after a %, and an s right after the closed parenthesis. Then, you use the % operator, with the format string on the left of the operator, and a dictionary on the right:

print old_style % {'thing':5}      # emits: this is 5
print old_style % {'thing':'test'} # emits: this is test

Of course, this code keeps working in Python 2.4, too. However, the new string.Template class offers a simpler alternative.

When you build a string.Template instance, you may include a dollar sign ($) by doubling it, and you may have the interpolated identifier immediately followed by letters or digits by enclosing it in curly braces ({ }). Here is an example that requires both of these refinements:

form_letter = '''Dear $customer,
I hope you are having a great time.
If you do not find Room $room to your satisfaction,
let us know. Please accept this $$5 coupon.
            Sincerely,
            $manager
            ${name}Inn'''
letter_template = string.Template(form_letter)
print letter_template.substitute({'name':'Sleepy', 'customer':'Fred Smith',
                                  'manager':'Barney Mills', 'room':307,
                                 })

This snippet emits the following output:

Dear Fred Smith,
I hope you are having a great time.
If you do not find Room 307 to your satisfaction,
let us know. Please accept this $5 coupon.
            Sincerely,
            Barney Mills
            SleepyInn

Sometimes, the handiest way to prepare a dictionary to be used as the argument to the substitute method is to set local variables, and then pass as the argument locals( ) (the artificial dictionary whose keys are the local variables, each with its value associated):

msg = string.Template('the square of $number is $square')
for number in range(10):
    square = number * number
    print msg.substitute(locals( ))

Another handy alternative is to pass the values to substitute using keyword argument syntax rather than a dictionary:

msg = string.Template('the square of $number is $square')
for i in range(10):
    print msg.substitute(number=i, square=i*i)

You can even pass both a dictionary and keyword arguments:

msg = string.Template('the square of $number is $square')
for number in range(10):
    print msg.substitute(locals( ), square=number*number)

In case of any conflict between entries in the dictionary and the values explicitly passed as keyword arguments, the keyword arguments take precedence. For example:

msg = string.Template('an $adj $msg')
adj = 'interesting'
print msg.substitute(locals( ), msg='message')
# emits an interesting message

See Also

Library Reference docs for string.Template (2.4 only) and the locals built-in function.

1.18. Replacing Multiple Patterns in a Single Pass

Credit: Xavier Defrang, Alex Martelli

Problem

You need to perform several string substitutions on a string.

Solution

Sometimes regular expressions afford the fastest solution even in cases where their applicability is not obvious. The powerful sub method of re objects (from the re module in the standard library) makes regular expressions particularly good at performing string substitutions. Here is a function returning a modified copy of an input string, where each occurrence of any string that’s a key in a given dictionary is replaced by the corresponding value in the dictionary:

import re
def multiple_replace(text, adict):
    rx = re.compile('|'.join(map(re.escape, adict)))
    def one_xlat(match):
        return adict[match.group(0)]
    return rx.sub(one_xlat, text)

Discussion

This recipe shows how to use the Python standard re module to perform single-pass multiple-string substitution using a dictionary. Let’s say you have a dictionary-based mapping between strings. The keys are the set of strings you want to replace, and the corresponding values are the strings with which to replace them. You could perform the substitution by calling the string method replace for each key/value pair in the dictionary, thus processing and creating a new copy of the entire text several times, but it is clearly better and faster to do all the changes in a single pass, processing and creating a copy of the text only once. re.sub’s callback facility makes this better approach quite easy.

First, we have to build a regular expression from the set of keys we want to match. Such a regular expression has a pattern of the form a1|a2|...|aN, made up of the N strings to be substituted, joined by vertical bars, and it can easily be generated using a one-liner, as shown in the recipe. Then, instead of giving re.sub a replacement string, we pass it a callback argument. re.sub then calls this object for each match, with a re.MatchObject instance as its only argument, and it expects the replacement string for that match as the call’s result. In our case, the callback just has to look up the matched text in the dictionary and return the corresponding value.

The function multiple_replace presented in the recipe recomputes the regular expression and redefines the one_xlat auxiliary function each time you call it. Often, you must perform substitutions on multiple strings based on the same, unchanging translation dictionary and would prefer to pay these setup prices only once. For such needs, you may prefer the following closure-based approach:

import re
def make_xlat(*args, **kwds):
    adict = dict(*args, **kwds)
    rx = re.compile('|'.join(map(re.escape, adict)))
    def one_xlat(match):
        return adict[match.group(0)]
    def xlat(text):
        return rx.sub(one_xlat, text)
    return xlat

You can call make_xlat, passing as its argument a dictionary, or any other combination of arguments you could pass to built-in dict in order to construct a dictionary; make_xlat returns a xlat closure that takes as its only argument text the string on which the substitutions are desired and returns a copy of text with all the substitutions performed.

Here’s a usage example for each half of this recipe. We would normally have such an example as a part of the same .py source file as the functions in the recipe, so it is guarded by the traditional Python idiom that runs it if and only if the module is called as a main script:

if _ _name_ _ == "_ _main_ _":
    text = "Larry Wall is the creator of Perl"
    adict = {
      "Larry Wall" : "Guido van Rossum",
      "creator" : "Benevolent Dictator for Life",
      "Perl" : "Python",
    }
    print multiple_replace(text, adict)
    translate = make_xlat(adict)
    print translate(text)

Substitutions such as those performed by this recipe are often intended to operate on entire words, rather than on arbitrary substrings. Regular expressions are good at picking up the beginnings and endings of words, thanks to the special sequence r'\b‘. We can easily make customized versions of either multiple_replace or make_xlat by simply changing the one line in which each of them builds and assigns the regular expression object rx into a slightly different form:

  rx = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, adict)))

The rest of the code is just the same as shown earlier in this recipe. However, this sameness is not necessarily good news: it suggests that if we need many similarly customized versions, each building the regular expression in slightly different ways, we’ll end up doing a lot of copy-and-paste coding, which is the worst form of code reuse, likely to lead to high maintenance costs in the future.

A key rule of good coding is: “once, and only once!” When we notice that we are duplicating code, we should notice this symptom as a “code smell,” and refactor our code for better reuse. In this case, for ease of customization, we need a class rather than a function or closure. For example, here’s how to write a class that works very similarly to make_xlat but can be customized by subclassing and overriding:

class make_xlat:
    def _ _init_ _(self, *args, **kwds):
        self.adict = dict(*args, **kwds)
        self.rx = self.make_rx( )
    def make_rx(self):
        return re.compile('|'.join(map(re.escape, self.adict)))
    def one_xlat(self, match):
        return self.adict[match.group(0)]
    def _ _call_ _(self, text):
        return self.rx.sub(self.one_xlat, text)

This is a “drop-in replacement” for the function of the same name: in other words, a snippet such as the one we showed, with the if _ _name_ _ == '_ _main_ _' guard, works identically when make_xlat is this class rather than the previously shown function. The function is simpler and faster, but the class’ important advantage is that it can easily be customized in the usual object-oriented way—subclassing it, and overriding some method. To translate by whole words, for example, all we need to code is:

class make_xlat_by_whole_words(make_xlat):
    def make_rx(self):
        return re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, self.adict)))

Ease of customization by subclassing and overriding helps you avoid copy-and-paste coding, and this is sometimes an excellent reason to prefer object-oriented structures over simpler functional structures, such as closures. Of course, just because some functionality is packaged as a class doesn’t magically make it customizable in just the way you want. Customizability also requires some foresight in dividing the functionality into separately overridable methods that correspond to the right pieces of overall functionality. Fortunately, you don’t have to get it right the first time; when code does not have the optimal internal structure for the task at hand (in this specific example, for reuse by subclassing and selective overriding), you can and should refactor the code so that its internal structure serves your needs. Just make sure you have a suitable battery of tests ready to run to ensure that your refactoring hasn’t broken anything, and then you can refactor to your heart’s content. See http://www.refactoring.com for more information on the important art and practice of refactoring.

See Also

Documentation for the re module in the Library Reference and Python in a Nutshell; the Refactoring home page (http://www.refactoring.com).

1.19. Checking a String for Any of Multiple Endings

Credit: Michele Simionato

Problem

For a certain string s, you must check whether s has any of several endings; in other words, you need a handy, elegant equivalent of s.endswith(end1) or s.endswith(end2) or s.endswith(end3) and so on.

Solution

The itertools.imap function is just as handy for this task as for many of a similar nature:

import itertools
def anyTrue(predicate, sequence):
    return True in itertools.imap(predicate, sequence)
def endsWith(s, *endings):
    return anyTrue(s.endswith, endings)

Discussion

A typical use for endsWith might be to print all names of image files in the current directory:

import os
for filename in os.listdir('.'):
    if endsWith(filename, '.jpg', '.jpeg', '.gif'):
       print filename

The same general idea shown in this recipe’s Solution is easily applied to other tasks related to checking a string for any of several possibilities. The auxiliary function anyTrue is general and fast, and you can pass it as its first argument (the predicate) other bound methods, such as s.startswith or s._ _contains_ _. Indeed, perhaps it would be better to do without the helper function endsWith—after all, directly coding

    if anyTrue(filename.endswith, (".jpg", ".gif", ".png")):

seems to be already readable enough.

This recipe originates from a discussion on news:comp.lang.python. and summarizes inputs from many people, including Raymond Hettinger, Chris Perkins, Bengt Richter and others.

See Also

Library Reference and Python in a Nutshell docs for itertools and string methods.

1.20. Handling International Text with Unicode

Credit: Holger Krekel

Problem

You need to deal with text strings that include non-ASCII characters.

Solution

Python has a first class unicode type that you can use in place of the plain bytestring str type. It’s easy, once you accept the need to explicitly convert between a bytestring and a Unicode string:

>>> german_ae = unicode('\xc3\xa4', 'utf8')

Here german_ae is a unicode string representing the German lowercase a with umlaut (i.e., diaeresis) character “ä”. It has been constructed from interpreting the bytestring '\xc3\xa4' according to the specified UTF-8 encoding. There are many encodings, but UTF-8 is often used because it is universal (UTF-8 can encode any Unicode string) and yet fully compatible with the 7-bit ASCII set (any ASCII bytestring is a correct UTF-8-encoded string).

Once you cross this barrier, life is easy! You can manipulate this Unicode string in practically the same way as a plain str string:

>>> sentence = "This is a " + german_ae
>>> sentence2 = "Easy!"
>>> para = ". ".join([sentence, sentence2])

Note that para is a Unicode string, because operations between a unicode string and a bytestring always result in a unicode string—unless they fail and raise an exception:

>>> bytestring = '\xc3\xa4'     # Uuh, some non-ASCII bytestring!
>>> german_ae += bytestringUnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in 
               position 0: ordinal not in range(128)

The byte '0xc3' is not a valid character in the 7-bit ASCII encoding, and Python refuses to guess an encoding. So, being explicit about encodings is the crucial point for successfully using Unicode strings with Python.

Discussion

Unicode is easy to handle in Python, if you respect a few guidelines and learn to deal with common problems. This is not to say that an efficient implementation of Unicode is an easy task. Luckily, as with other hard problems, you don’t have to care much: you can just use the efficient implementation of Unicode that Python provides.

The most important issue is to fully accept the distinction between a bytestring and a unicode string. As exemplified in this recipe’s solution, you often need to explicitly construct a unicode string by providing a bytestring and an encoding. Without an encoding, a bytestring is basically meaningless, unless you happen to be lucky and can just assume that the bytestring is text in ASCII.

The most common problem with using Unicode in Python arises when you are doing some text manipulation where only some of your strings are unicode objects and others are bytestrings. Python makes a shallow attempt to implicitly convert your bytestrings to Unicode. It usually assumes an ASCII encoding, though, which gives you UnicodeDecodeError exceptions if you actually have non-ASCII bytes somewhere. UnicodeDecodeError tells you that you mixed Unicode and bytestrings in such a way that Python cannot (doesn’t even try to) guess the text your bytestring might represent.

Developers from many big Python projects have come up with simple rules of thumb to prevent such runtime UnicodeDecodeErrors, and the rules may be summarized into one sentence: always do the conversion at IO barriers. To express this same concept a bit more extensively:

  • Whenever your program receives text data “from the outside” (from the network, from a file, from user input, etc.), construct unicode objects immediately. Find out the appropriate encoding, for example, from an HTTP header, or look for an appropriate convention to determine the encoding to use.

  • Whenever your program sends text data “to the outside” (to the network, to some file, to the user, etc.), determine the correct encoding, and convert your text to a bytestring with that encoding. (Otherwise, Python attempts to convert Unicode to an ASCII bytestring, likely producing UnicodeEncodeErrors, which are just the converse of the UnicodeDecodeErrors previously mentioned).

With these two rules, you will solve most Unicode problems. If you still get UnicodeErrors of either kind, look for where you forgot to properly construct a unicode object, forgot to properly convert back to an encoded bytestring, or ended up using an inappropriate encoding due to some mistake. (It is quite possible that such encoding mistakes are due to the user, or some other program that is interacting with yours, not following the proper encoding rules or conventions.)

In order to convert a Unicode string back to an encoded bytestring, you usually do something like:

>>> bytestring = german_ae.decode('latin1')
>>> bytestring'\xe4'

Now bytestring is a German ae character in the 'latin1' encoding. Note how '\xe4' (in Latin1) and the previously shown '\xc3\xa4' (in UTF-8) represent the same German character, but in different encodings.

By now, you can probably imagine why Python refuses to guess among the hundreds of possible encodings. It’s a crucial design choice, based on one of the Zen of Python principles: “In the face of ambiguity, resist the temptation to guess.” At any interactive Python shell prompt, enter the statement import this to read all of the important principles that make up the Zen of Python.

See Also

Unicode is a huge topic, but a recommended book is Unicode: A Primer, by Tony Graham (Hungry Minds, Inc.)—details are available at http://www.menteith.com/unicode/primer/; and a short but complete article from Joel Spolsky, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses)!,” located at http://www.joelonsoftware.com/articles/Unicode.html. See also the Library Reference and Python in a Nutshell documentation about the built-in str and unicode types and modules unidata and codecs; also, Recipe 1.21 and Recipe 1.22.

1.21. Converting Between Unicode and Plain Strings

Credit: David Ascher, Paul Prescod

Problem

You need to deal with textual data that doesn’t necessarily fit in the ASCII character set.

Solution

Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:

unicodestring = u"Hello world"
# Convert Unicode to plain Python string: "encode"
utf8string = unicodestring.encode("utf-8")
asciistring = unicodestring.encode("ascii")
isostring = unicodestring.encode("ISO-8859-1")
utf16string = unicodestring.encode("utf-16")
# Convert plain Python string to Unicode: "decode"
plainstring1 = unicode(utf8string, "utf-8")
plainstring2 = unicode(asciistring, "ascii")
plainstring3 = unicode(isostring, "ISO-8859-1")
plainstring4 = unicode(utf16string, "utf-16")
assert plainstring1 == plainstring2 == plainstring3 == plainstring4

Discussion

If you find yourself dealing with text that contains non-ASCII characters, you have to learn about Unicode—what it is, how it works, and how Python uses it. The preceding Recipe 1.20 offers minimal but crucial practical tips, and this recipe tries to offer more perspective.

You don’t need to know everything about Unicode to be able to solve real-world problems with it, but a few basic tidbits of knowledge are indispensable. First, you must understand the difference between bytes and characters. In older, ASCII-centric languages and environments, bytes and characters are treated as if they were the same thing. A byte can hold up to 256 different values, so these environments are limited to dealing with no more than 256 distinct characters. Unicode, on the other hand, has tens of thousands of characters, which means that each Unicode character takes more than one byte; thus you need to make the distinction between characters and bytes.

Standard Python strings are really bytestrings, and a Python character, being such a string of length 1, is really a byte. Other terms for an instance of the standard Python string type are 8-bit string and plain string. In this recipe we call such instances bytestrings, to remind you of their byte orientation.

A Python Unicode character is an abstract object big enough to hold any character, analogous to Python’s long integers. You don’t have to worry about the internal representation; the representation of Unicode characters becomes an issue only when you are trying to send them to some byte-oriented function, such as the write method of files or the send method of network sockets. At that point, you must choose how to represent the characters as bytes. Converting from Unicode to a bytestring is called encoding the string. Similarly, when you load Unicode strings from a file, socket, or other byte-oriented object, you need to decode the strings from bytes to characters.

Converting Unicode objects to bytestrings can be achieved in many ways, each of which is called an encoding. For a variety of historical, political, and technical reasons, there is no one “right” encoding. Every encoding has a case-insensitive name, and that name is passed to the encode and decode methods as a parameter. Here are a few encodings you should know about:

  • The UTF-8 encoding can handle any Unicode character. It is also backwards compatible with ASCII, so that a pure ASCII file can also be considered a UTF-8 file, and a UTF-8 file that happens to use only ASCII characters is identical to an ASCII file with the same characters. This property makes UTF-8 very backwards-compatible, especially with older Unix tools. UTF-8 is by far the dominant encoding on Unix, as well as the default encoding for XML documents. UTF-8’s primary weakness is that it is fairly inefficient for eastern-language texts.

  • The UTF-16 encoding is favored by Microsoft operating systems and the Java environment. It is less efficient for western languages but more efficient for eastern ones. A variant of UTF-16 is sometimes known as UCS-2.

  • The ISO-8859 series of encodings are supersets of ASCII, each able to deal with 256 distinct characters. These encodings cannot support all of the Unicode characters; they support only some particular language or family of languages. ISO-8859-1, also known as “Latin-1”, covers most western European and African languages, but not Arabic. ISO-8859-2, also known as “Latin-2”, covers many eastern European languages such as Hungarian and Polish. ISO-8859-15, very popular in Europe these days, is basically the same as ISO-8859-1 with the addition of the Euro currency symbol as a character.

If you want to be able to encode all Unicode characters, you’ll probably want to use UTF-8. You will need to deal with the other encodings only when you are handed data in those encodings created by some other application or input device, or vice versa, when you need to prepare data in a specified encoding to accommodate another application downstream of yours, or an output device. In particular, Recipe 1.22 shows how to handle the case in which the downstream application or device is driven from your program’s standard output stream.

See Also

Unicode is a huge topic, but a recommended book is Tony Graham, Unicode: A Primer (Hungry Minds)—details are available at http://www.menteith.com/unicode/primer/; and a short, but complete article from Joel Spolsky, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses)!” is located at http://www.joelonsoftware.com/articles/Unicode.html. See also the Library Reference and Python in a Nutshell documentation about the built-in str and unicode types, and modules unidata and codecs; also, Recipe 1.20 and Recipe 1.22.

1.22. Printing Unicode Charactersto Standard Output

Credit: David Ascher

Problem

You want to print Unicode strings to standard output (e.g., for debugging), but they don’t fit in the default encoding.

Solution

Wrap the sys.stdout stream with a converter, using the codecs module of Python’s standard library. For example, if you know your output is going to a terminal that displays characters according to the ISO-8859-1 encoding, you can code:

import codecs, sys
sys.stdout = codecs.lookup('iso8859-1')[-1](sys.stdout)

Discussion

Unicode strings live in a large space, big enough for all of the characters in every language worldwide, but thankfully the internal representation of Unicode strings is irrelevant for users of Unicode. Alas, a file stream, such as sys.stdout, deals with bytes and has an encoding associated with it. You can change the default encoding that is used for new files by modifying the site module. That, however, requires changing your entire Python installation, which is likely to confuse other applications that may expect the encoding you originally configured Python to use (typically the Python standard encoding, which is ASCII). Therefore, this kind of modification is not to be recommended.

This recipe takes a sounder approach: it rebinds sys.stdout as a stream that expects Unicode input and outputs it in ISO-8859-1 (also known as “Latin-1”). This approach doesn’t change the encoding of any previous references to sys.stdout, as illustrated here. First, we keep a reference to the original, ASCII-encoded sys.stdout:

>>> old = sys.stdout

Then, we create a Unicode string that wouldn’t normally be able to go through sys.stdout:

>>> char = u"\N{LATIN SMALL LETTER A WITH DIAERESIS}"
>>> print charTraceback (most recent call last):
                 File "<stdin>", line 1, in ?
               UnicodeError: ASCII encoding error: ordinal not in range(128)

If you don’t get an error from this operation, it’s because Python thinks it knows which encoding your “terminal” is using (in particular, Python is likely to use the right encoding if your “terminal” is IDLE, the free development environment that comes with Python). But, suppose you do get this error, or get no error but the output is not the character you expected, because your “terminal” uses UTF-8 encoding and Python does not know about it. When that is the case, we can just wrap sys.stdout in the codecs stream writer for UTF-8, which is a much richer encoding, then rebind sys.stdout to it and try again:

>>> sys.stdout = codecs.lookup('utf-8')[-1](sys.stdout)
>>> print char
ä

This approach works only if your “terminal”, terminal emulator, or other window in which you’re running the interactive Python interpreter supports the UTF-8 encoding, with a font rich enough to display all the characters you need to output. If you don’t have such a program or device available, you may be able to find a suitable one for your platform in the form of a free program downloadable from the Internet.

Python tries to determine which encoding your “terminal” is using and sets that encoding’s name as attribute sys.stdout.encoding. Sometimes (alas, not always) it even manages to get it right. IDLE already wraps your sys.stdout, as suggested in this recipe, so, within the environment’s interactive Python shell, you can directly print Unicode strings.

See Also

Documentation for the codecs and site modules, and setdefaultencoding in module sys, in the Library Reference and Python in a Nutshell; Recipe 1.20 and Recipe 1.21.

1.23. Encoding Unicode Data for XML and HTML

Credit: David Goodger, Peter Cogolo

Problem

You want to encode Unicode text for output in HTML, or some other XML application, using a limited but popular encoding such as ASCII or Latin-1.

Solution

Python provides an encoding error handler named xmlcharrefreplace, which replaces all characters outside of the chosen encoding with XML numeric character references:

def encode_for_xml(unicode_data, encoding='ascii'):
    return unicode_data.encode(encoding, 'xmlcharrefreplace')

You could use this approach for HTML output, too, but you might prefer to use HTML’s symbolic entity references instead. For this purpose, you need to define and register a customized encoding error handler. Implementing that handler is made easier by the fact that the Python Standard Library includes a module named htmlentitydefs that holds HTML entity definitions:

import codecs
from htmlentitydefs import codepoint2name
def html_replace(exc):
    if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
        s = [ u'&%s;' % codepoint2name[ord(c)]
              for c in exc.object[exc.start:exc.end] ]
        return ''.join(s), exc.end
    else:
        raise TypeError("can't handle %s" % exc._ _name_ _) 
codecs.register_error('html_replace', html_replace)

After registering this error handler, you can optionally write a function to wrap its use:

def encode_for_html(unicode_data, encoding='ascii'):
    return unicode_data.encode(encoding, 'html_replace')

Discussion

As with any good Python module, this module would normally proceed with an example of its use, guarded by an if _ _name_ _ == '_ _main_ _' test:

if _ _name_ _ == '_ _main_ _':
    # demo
    data = u'''\
<html>
<head>
<title>Encoding Test</title>
</head>
<body>
<p>accented characters:
<ul>
<li>\xe0 (a + grave)
<li>\xe7 (c + cedilla)
<li>\xe9 (e + acute)
</ul>
<p>symbols:
<ul>
<li>\xa3 (British pound)
<li>\u20ac (Euro)
<li>\u221e (infinity)
</ul>
</body></html>
'''
    print encode_for_xml(data)
    print encode_for_html(data)

If you run this module as a main script, you will then see such output as (from function encode_for_xml):

<li>&#224; (a + grave)
<li>&#231; (c + cedilla)
<li>&#233; (e + acute)...
<li>&#163; (British pound)
<li>&#8364; (Euro)
<li>&#8734; (infinity)

as well as (from function encode_for_html):

<li>&agrave; (a + grave)
<li>&ccedil; (c + cedilla)
<li>&eacute; (e + acute)...
<li>&pound; (British pound)
<li>&euro; (Euro)
<li>&infin; (infinity)

There is clearly a niche for each case, since encode_for_xml is more general (you can use it for any XML application, not just HTML), but encode_for_html may produce output that’s easier to read—should you ever need to look at it directly, edit it further, and so on. If you feed either form to a browser, you should view it in exactly the same way. To visualize both forms of encoding in a browser, run this recipe’s module as a main script, redirect the output to a disk file, and use a text editor to separate the two halves before you view them with a browser. (Alternatively, run the script twice, once commenting out the call to encode_for_xml, and once commenting out the call to encode_for_html.)

Remember that Unicode data must always be encoded before being printed or written out to a file. UTF-8 is an ideal encoding, since it can handle any Unicode character. But for many users and applications, ASCII or Latin-1 encodings are often preferred over UTF-8. When the Unicode data contains characters that are outside of the given encoding (e.g., accented characters and most symbols are not encodable in ASCII, and the “infinity” symbol is not encodable in Latin-1), these encodings cannot handle the data on their own. Python supports a built-in encoding error handler called xmlcharrefreplace, which replaces unencodable characters with XML numeric character references, such as &#8734; for the “infinity” symbol. This recipe shows how to write and register another similar error handler, html_replace, specifically for producing HTML output. html_replace replaces unencodable characters with more readable HTML symbolic entity references, such as &infin; for the “infinity” symbol. html_replace is less general than xmlcharrefreplace, since it does not support all Unicode characters and cannot be used with non-HTML applications; however, it can still be useful if you want HTML output that is as readable as possible in a “view page source” context.

Neither of these error handlers makes sense for output that is neither HTML nor some other form of XML. For example, TeX and other markup languages do not recognize XML numeric character references. However, if you know how to build an arbitrary character reference for such a markup language, you may modify the example error handler html_replace shown in this recipe’s Solution to code and register your own encoding error handler.

An alternative (and very effective!) way to perform encoding of Unicode data into a file, with a given encoding and error handler of your choice, is offered by the codecs module in Python’s standard library:

outfile = codecs.open('out.html', mode='w', encoding='ascii',
                       errors='html_replace')

You can now use outfile.write(unicode_data) for any arbitrary Unicode string unicode_data, and all the encoding and error handling will be taken care of transparently. When your output is finished, of course, you should call outfile.close( ).

See Also

Library Reference and Python in a Nutshell docs for modules codecs and htmlentitydefs.

1.24. Making Some Strings Case-Insensitive

Credit: Dale Strickland-Clark, Peter Cogolo, Mark McMahon

Problem

You want to treat some strings so that all comparisons and lookups are case-insensitive, while all other uses of the strings preserve the original case.

Solution

The best solution is to wrap the specific strings in question into a suitable subclass of str:

class iStr(str):
    """
    Case insensitive string class.
    Behaves just like str, except that all comparisons and lookups
    are case insensitive.
    """
    def _ _init_ _(self, *args):
        self._lowered = str.lower(self)
    def _ _repr_ _(self):
        return '%s(%s)' % (type(self)._ _name_ _, str._ _repr_ _(self))
    def _ _hash_ _(self):
        return hash(self._lowered)
    def lower(self):
        return self._lowered
def _make_case_insensitive(name):
    ''' wrap one method of str into an iStr one, case-insensitive '''
    str_meth = getattr(str, name)
    def x(self, other, *args):
        ''' try lowercasing 'other', which is typically a string, but
            be prepared to use it as-is if lowering gives problems,
            since strings CAN be correctly compared with non-strings.
        '''
        try: other = other.lower( )
        except (TypeError, AttributeError, ValueError): pass
        return str_meth(self._lowered, other, *args)
    # in Python 2.4, only, add the statement: x.func_name = name
    setattr(iStr, name, x)
# apply the _make_case_insensitive function to specified methods 
for name in 'eq lt le gt gt ne cmp contains'.split( ):
    _make_case_insensitive('_ _%s_ _' % name)
for name in 'count endswith find index rfind rindex startswith'.split( ):
    _make_case_insensitive(name)
# note that we don't modify methods 'replace', 'split', 'strip', ...
# of course, you can add modifications to them, too, if you prefer that.
del _make_case_insensitive    # remove helper function, not needed any more

Discussion

Some implementation choices in class iStr are worthy of notice. First, we choose to generate the lowercase version once and for all, in method _ _init_ _, since we envision that in typical uses of iStr instances, this version will be required repeatedly. We hold that version in an attribute that is private, but not overly so (i.e., has a name that begins with one underscore, not two), because if iStr gets subclassed (e.g., to make a more extensive version that also offers case-insensitive splitting, replacing, etc., as the comment in the “Solution” suggests), iStr’s subclasses are quite likely to want to access this crucial “implementation detail” of superclass iStr!

We do not offer “case-insensitive” versions of such methods as replace, because it’s anything but clear what kind of input-output relation we might want to establish in the general case. Application-specific subclasses may therefore be the way to provide this functionality in ways appropriate to a given application. For example, since the replace method is not wrapped, calling replace on an instance of iStr returns an instance of str, not of iStr. If that is a problem in your application, you may want to wrap all iStr methods that return strings, simply to ensure that the results are made into instances of iStr. For that purpose, you need another, separate helper function, similar but not identical to the _make_case_insensitive one shown in the “Solution”:

def _make_return_iStr(name):
    str_meth = getattr(str, name)
    def x(*args):
        return iStr(str_meth(*args))
    setattr(iStr, name, x)

and you need to call this helper function _make_return_iStr on all the names of relevant string methods returning strings such as:

for name in 'center ljust rjust strip lstrip rstrip'.split( ):
    _make_return_iStr(name)

Strings have about 20 methods (including special methods such as _ _add_ _ and _ _mul_ _) that you should consider wrapping in this way. You can also wrap in this way some additional methods, such as split and join, which may require special handling, and others, such as encode and decode, that you cannot deal with unless you also define a case-insensitive unicode subtype. In practice, one can hope that not every single one of these methods will prove problematic in a typical application. However, as you can see, the very functional richness of Python strings makes it a bit of work to customize string subtypes fully, in a general way without depending on the needs of a specific application.

The implementation of iStr is careful to avoid the boilerplate code (meaning repetitious and therefore bug-prone code) that we’d need if we just overrode each needed method of str in the normal way, with def statements in the class body. A custom metaclass or other such advanced technique would offer no special advantage in this case, so the boilerplate avoidance is simply obtained with one helper function that generates and installs wrapper closures, and two loops using that function, one for normal methods and one for special ones. The loops need to be placed after the class statement, as we do in this recipe’s Solution, because they need to modify the class object iStr, and the class object doesn’t exist yet (and thus cannot be modified) until the class statement has completed.

In Python 2.4, you can reassign the func_name attribute of a function object, and in this case, you should do so to get clearer and more readable results when introspection (e.g., the help function in an interactive interpreter session) is applied to an iStr instance. However, Python 2.3 considers attribute func_name of function objects to be read-only; therefore, in this recipe’s Solution, we have indicated this possibility only in a comment, to avoid losing Python 2.3 compatibility over such a minor issue.

Case-insensitive (but case-preserving) strings have many uses, from more tolerant parsing of user input, to filename matching on filesystems that share this characteristic, such as all of Windows filesystems and the Macintosh default filesystem. You might easily find yourself creating a variety of “case-insensitive” container types, such as dictionaries, lists, sets, and so on—meaning containers that go out of their way to treat string-valued keys or items as if they were case-insensitive. Clearly a better architecture is to factor out the functionality of “case-insensitive” comparisons and lookups once and for all; with this recipe in your toolbox, you can just add the required wrapping of strings into iStr instances wherever you may need it, including those times when you’re making case-insensitive container types.

For example, a list whose items are basically strings, but are to be treated case-insensitively (for sorting purposes and in such methods as count and index), is reasonably easy to build on top of iStr:

class iList(list):
    def _ _init_ _(self, *args):
        list._ _init_ _(self, *args)
        # rely on _ _setitem_ _ to wrap each item into iStr...
        self[:] = self
    wrap_each_item = iStr
    def _ _setitem_ _(self, i, v):
        if isinstance(i, slice): v = map(self.wrap_each_item, v)
        else: v = self.wrap_each_item(v)
        list._ _setitem_ _(self, i, v)
    def append(self, item):
        list.append(self, self.wrap_each_item(item))
    def extend(self, seq):
        list.extend(self, map(self.wrap_each_item, seq))

Essentially, all we’re doing is ensuring that every item that gets into an instance of iList gets wrapped by a call to iStr, and everything else takes care of itself.

Incidentally, this example class iList is accurately coded so that you can easily make customized subclasses of iList to accommodate application-specific subclasses of iStr: all such a customized subclass of iList needs to do is override the single class-level member named wrap_each_item.

See Also

Library Reference and Python in a Nutshell sections on str, string methods, and special methods used in comparisons and hashing.

1.25. Converting HTML Documents to Texton a Unix Terminal

Credit: Brent Burley, Mark Moraes

Problem

You need to visualize HTML documents as text, with support for bold and underlined display on your Unix terminal.

Solution

The simplest approach is to code a filter script, taking HTML on standard input and emitting text and terminal control sequences on standard output. Since this recipe only targets Unix, we can get the needed terminal control sequences from the “Unix” command tput, via the function popen of the Python Standard Library module os:

#!/usr/bin/env python
import sys, os, htmllib, formatter
# use Unix tput to get the escape sequences for bold, underline, reset
set_bold = os.popen('tput bold').read( )
set_underline = os.popen('tput smul').read( )
perform_reset = os.popen('tput sgr0').read( )
class TtyFormatter(formatter.AbstractFormatter):
    ''' a formatter that keeps track of bold and italic font states, and
        emits terminal control sequences accordingly.
    '''
    def _ _init_ _(self, writer):
        # first, as usual, initialize the superclass
        formatter.AbstractFormatter._ _init_ _(self, writer)
        # start with neither bold nor italic, and no saved font state
        self.fontState = False, False
        self.fontStack = [  ]
    def push_font(self, font):
        # the `font' tuple has four items, we only track the two flags
        # about whether italic and bold are active or not
        size, is_italic, is_bold, is_tt = font
        self.fontStack.append((is_italic, is_bold))
        self._updateFontState( )
    def pop_font(self, *args):
        # go back to previous font state
        try:
            self.fontStack.pop( )
        except IndexError:
            pass
        self._updateFontState( )
    def updateFontState(self):
        # emit appropriate terminal control sequences if the state of
        # bold and/or italic(==underline) has just changed
        try:
            newState = self.fontStack[-1]
        except IndexError:
            newState = False, False
        if self.fontState != newState:
            # relevant state change: reset terminal
            print perform_reset,
            # set underine and/or bold if needed
            if newState[0]:
                print set_underline,
            if newState[1]:
                print set_bold,
            # remember the two flags as our current font-state
            self.fontState = newState
# make writer, formatter and parser objects, connecting them as needed
myWriter = formatter.DumbWriter( )
if sys.stdout.isatty( ):
    myFormatter = TtyFormatter(myWriter)
else:
    myFormatter = formatter.AbstractFormatter(myWriter)
myParser = htmllib.HTMLParser(myFormatter)
# feed all of standard input to the parser, then terminate operations
myParser.feed(sys.stdin.read( ))
myParser.close( )

Discussion

The basic formatter.AbstractFormatter class, offered by the Python Standard Library, should work just about anywhere. On the other hand, the refinements in the TtyFormatter subclass that’s the focus of this recipe depend on using a Unix-like terminal, and more specifically on the availability of the tput Unix command to obtain information on the escape sequences used to get bold or underlined output and to reset the terminal to its base state.

Many systems that do not have Unix certification, such as Linux and Mac OS X, do have a perfectly workable tput command and therefore can use this recipe’s TtyFormatter subclass just fine. In other words, you can take the use of the word “Unix” in this recipe just as loosely as you can take it in just about every normal discussion: take it as meaning “*ix,” if you will.

If your “terminal” emulator supports other escape sequences for controlling output appearance, you should be able to adapt this TtyFormatter class accordingly. For example, on Windows, a cmd.exe command window should, I’m told, support standard ANSI escape sequences, so you could choose to hard-code those sequences if Windows is the platform on which you want to run your version of this script.

In many cases, you may prefer to use other existing Unix commands, such as lynx -dump -, to get richer formatting than this recipe provides. However, this recipe comes in quite handy when you find yourself on a system that has a Python installation but lacks such other helpful commands as lynx.

See Also

Library Reference and Python in a Nutshell docs on the formatter and htmllib modules; man tput on a Unix or Unix-like system for more information about the tput command.

Get Python Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.