Accessing Substrings

Credit: Alex Martelli

Problem

You want to access portions of a string. For example, you’ve read a fixed-width record and want to extract the record’s fields.

Solution

Slicing is great, of course, but it only does one field at a time:

afield = theline[3:8]

If you need to think in terms of field length, struct.unpack may be appropriate. Here’s an example of getting a five-byte string, skipping three bytes, getting two eight-byte strings, and then getting the rest:

import struct

# Get a 5-byte string, skip 3, get two 8-byte strings, then all the rest:
baseformat = "5s 3x 8s 8s"
numremain = len(theline)-struct.calcsize(baseformat)
format = "%s %ds" % (baseformat, numremain)
leading, s1, s2, trailing = struct.unpack(format, theline)

If you need to split at five-byte boundaries, here’s how you could do it:

numfives, therest = divmod(len(theline), 5)
form5 = "%s %dx" % ("5s "*numfives, therest)
fivers = struct.unpack(form5, theline)

Chopping a string into individual characters is of course easier:

chars = list(theline)

If you prefer to think of your data as being cut up at specific columns, slicing within list comprehensions may be handier:

cuts = [8,14,20,26,30]
pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[sys.maxint]) ]

Discussion

This recipe was inspired by Recipe 1.1 in the Perl Cookbook. Python’s slicing takes the place of Perl’s substr. Perl’s built-in unpack and Python’s struct.unpack are similar. Perl’s is slightly handier, as it accepts a field length of * for the last field to mean all the rest. In Python, we have to compute and insert the exact length for either extraction or skipping. This isn’t a major issue, because such extraction tasks will usually be encapsulated into small, probably local functions. Memoizing, or automatic caching, may help with performance if the function is called repeatedly, since it allows you to avoid redoing the preparation of the format for the struct unpacking. See also Recipe 17.8.

In a purely Python context, the point of this recipe is to remind you that struct.unpack is often viable, and sometimes preferable, as an alternative to string slicing (not quite as often as unpack versus substr in Perl, given the lack of a *-valued field length, but often enough to be worth keeping in mind).

Each of these snippets is, of course, best encapsulated in a function. Among other advantages, encapsulation ensures we don’t have to work out the computation of the last field’s length on each and every use. This function is the equivalent of the first snippet in the solution:

def fields(baseformat, theline, lastfield=None):
    numremain = len(theline)-struct.calcsize(baseformat)
    format = "%s %d%s" % (baseformat, numremain, lastfield and "s" or "x")
    return struct.unpack(format, theline)

If this function is called in a loop, caching with a key of (baseformat, len(theline), lastfield) may be useful here because it can offer an easy speed-up.

The function equivalent of the second snippet in the solution is:

def split_by(theline, n, lastfield=None):
    numblocks, therest = divmod(len(theline), n)
    baseblock = "%d%s"%(n, lastfield and "s" or "x")
    format = "%s %dx"%(baseblock*numblocks, therest)

And for the third snippet:

def split_at(theline, cuts, lastfield=None):
    pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts) ]
    if lastfield:
        pieces.append(theline(cuts[-1:]))
    return pieces

In each of these functions, a decision worth noticing (and, perhaps, worth criticizing) is that of having a lastfield=None optional parameter. This reflects the observation that while we often want to skip the last, unknown-length subfield, sometimes we want to retain it instead. The use of lastfield in the expression lastfield and "s" or "x" (equivalent to C’s lastfield?'s':'c') saves an if/else, but it’s unclear whether the saving is worth it. "sx"[not lastfield] and other similar alternatives are roughly equivalent in this respect; see Recipe 17.6. When lastfield is false, applying struct.unpack to just a prefix of theline (specifically, theline[:struct.calcsize(format)]) is an alternative, but it’s not easy to merge with the case of lastfield being true, when the format does need a supplementary field for len(theline)-struct.calcsize(format).

See Also

Recipe 17.6 and Recipe 17.8; Perl Cookbook Recipe 1.1.

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.