Processing Every Word in a File

Credit: Luther Blissett

Problem

You need to do something to every word in a file, similar to the foreach function of csh.

Solution

This is best handled by two nested loops, one on lines and one on the words in each line:

for line in open(thefilepath).xreadlines(  ):
    for word in line.split(  ):
        dosomethingwith(word)

This implicitly defines words as sequences of nonspaces separated by sequences of spaces (just as the Unix program wc does). For other definitions of words, you can use regular expressions. For example:

import re
re_word = re.compile(r'[\w-]+')

for line in open(thefilepath).xreadlines(  ):
    for word in re_word.findall(line):
        dosomethingwith(word)

In this case, a word is defined as a maximal sequence of alphanumerics and hyphens.

Discussion

For other definitions of words you will obviously need different regular expressions. The outer loop, on all lines in the file, can of course be done in many ways. The xreadlines method is good, but you can also use the list obtained by the readlines method, the standard library module fileinput, or, in Python 2.2, even just:

for line in open(thefilepath):

which is simplest and fastest.

In Python 2.2, it’s often a good idea to wrap iterations as iterator objects, most commonly by simple generators:

from _ _future_ _ import generators

def words_of_file(thefilepath):
    for line in open(thefilepath):
        for word in line.split(  ):
            yield word

for word in words_of_file(thefilepath):
    dosomethingwith(word)

This approach lets you separate, cleanly and effectively, two different concerns: how to iterate over all items (in this case, words in a file) and what to do with each item in the iteration. Once you have cleanly encapsulated iteration concerns in an iterator object (often, as here, a generator), most of your uses of iteration become simple for statements. You can often reuse the iterator in many spots in your program, and if maintenance is ever needed, you can then perform it in just one place—the definition of the iterator—rather than having to hunt for all uses. The advantages are thus very similar to those you obtain, in any programming language, by appropriately defining and using functions rather than copying and pasting pieces of code all over the place. With Python 2.2’s iterators, you can get these advantages for looping control structures, too.

See Also

Documentation for the fileinput module in the Library Reference; PEP 255 on simple generators (http://www.python.org/peps/pep-0255.html); Perl Cookbook Recipe 8.3.

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.