File Tools

External files are at the heart of much of what we do with shell utilities. For instance, a testing system may read its inputs from one file, store program results in another file, and check expected results by loading yet another file. Even user interface and Internet-oriented programs may load binary images and audio clips from files on the underlying computer. It’s a core programming concept.

In Python, the built-in open function is the primary tool scripts use to access the files on the underlying computer system. Since this function is an inherent part of the Python language, you may already be familiar with its basic workings. Technically, open gives direct access to the stdio filesystem calls in the system’s C library—it returns a new file object that is connected to the external file and has methods that map more or less directly to file calls on your machine. The open function also provides a portable interface to the underlying filesystem—it works the same way on every platform on which Python runs.

Other file-related interfaces in Python allow us to do things such as manipulate lower-level descriptor-based files (os module), store objects away in files by key (anydbm and shelve modules), and access SQL databases. Most of these are larger topics addressed in Chapter 19.

In this chapter, we’ll take a brief tutorial look at the built-in file object and explore a handful of more advanced file-related topics. As usual, you should consult the library manual’s file object entry for further details and methods we don’t have space to cover here. Remember, for quick interactive help, you can also run dir(file) for an attributes list with methods, help(file) for general help, and help(file.read) for help on a specific method such as read. The built-in name file identifies the file datatype in recent Python releases.[*]

Built-In File Objects

For most purposes, the open function is all you need to remember to process files in your scripts. The file object returned by open has methods for reading data (read, readline, readlines), writing data (write, writelines), freeing system resources (close), moving about in the file (seek), forcing data to be transferred out of buffers (flush), fetching the underlying file handle (fileno), and more. Since the built-in file object is so easy to use, though, let’s jump right into a few interactive examples.

Output files

To make a new file, call open with two arguments: the external name of the file to be created and a mode string w (short for write). To store data on the file, call the file object’s write method with a string containing the data to store, and then call the close method to close the file if you wish to open it again within the same program or session:

C:\temp>python
>>> file = open('data.txt', 'w')            # open output file object: creates
>>> file.write('Hello file world!\n')       # writes strings verbatim
>>> file.write('Bye   file world.\n')
>>> file.close( )                           # closed on gc and exit too

And that’s it—you’ve just generated a brand-new text file on your computer, regardless of the computer on which you type this code:

C:\temp>dir data.txt /B
data.txt

C:\temp>type data.txt
Hello file world!
Bye   file world.

There is nothing unusual about the new file; here, I use the DOS dir and type commands to list and display the new file, but it shows up in a file explorer GUI too.

Opening

In the open function call shown in the preceding example, the first argument can optionally specify a complete directory path as part of the filename string. If we pass just a simple filename without a path, the file will appear in Python’s current working directory. That is, it shows up in the place where the code is run. Here, the directory C:\temp on my machine is implied by the bare filename data.txt, so this actually creates a file at C:\temp\data.txt. More accurately, the filename is relative to the current working directory if it does not include a complete absolute directory path. See the section "Current Working Directory,” in Chapter 3, for a refresher on this topic.

Also note that when opening in w mode, Python either creates the external file if it does not yet exist or erases the file’s current contents if it is already present on your machine (so be careful out there—you’ll delete whatever was in the file before).

Writing

Notice that we added an explicit \n end-of-line character to lines written to the file; unlike the print statement, file write methods write exactly what they are passed without adding any extra formatting. The string passed to write shows up byte for byte on the external file.

Output files also sport a writelines method, which simply writes all of the strings in a list one at a time without adding any extra formatting. For example, here is a writelines equivalent to the two write calls shown earlier:

file.writelines(['Hello file world!\n', 'Bye   file world.\n'])

This call isn’t as commonly used (and can be emulated with a simple for loop), but it is convenient in scripts that save output in a list to be written later.

Closing

The file close method used earlier finalizes file contents and frees up system resources. For instance, closing forces buffered output data to be flushed out to disk. Normally, files are automatically closed when the file object is garbage collected by the interpreter (i.e., when it is no longer referenced) and when the Python session or program exits. Because of that, close calls are often optional. In fact, it’s common to see file-processing code in Python like this:

open('somefile.txt', 'w').write("G'day Bruce\n")

Since this expression makes a temporary file object, writes to it immediately, and does not save a reference to it, the file object is reclaimed and closed right away without ever having called the close method explicitly.

Tip

But note that this auto-close on reclaim file feature may change in future Python releases. Moreover, the Jython Java-based Python implementation discussed later does not reclaim files as immediately as the standard Python system (it uses Java’s garbage collector). If your script makes many files and your platform limits the number of open files per program, explicit close calls are a robust habit to form.

Also note that some IDEs, such as Python’s standard IDLE GUI, may hold on to your file objects longer than you expect, and thus prevent them from being garbage collected. If you write to an output file in IDLE, be sure to explicitly close (or flush) your file if you need to read it back in the same IDLE session. Otherwise, output buffers won’t be flushed to disk and your file may be incomplete when read.

Input files

Reading data from external files is just as easy as writing, but there are more methods that let us load data in a variety of modes. Input text files are opened with either a mode flag of r (for “read”) or no mode flag at all—it defaults to r if omitted, and it commonly is. Once opened, we can read the lines of a text file with the readlines method:

>>>file = open('data.txt', 'r')             # open input 
 file
 object
>>> for line in file.readlines( ):          # read into line string list
...     print line,                          # lines have '\n' at end
...
Hello file world!
Bye   file world.

The readlines method loads the entire contents of the file into memory and gives it to our scripts as a list of line strings that we can step through in a loop. In fact, there are many ways to read an input file:

file.read( )

Returns a string containing all the bytes stored in the file

file.read(N)

Returns a string containing the next N bytes from the file

file.readline( )

Reads through the next \n and returns a line string

file.readlines( )

Reads the entire file and returns a list of line strings

Let’s run these method calls to read files, lines, and bytes (more on the seek call, used here to rewind the file, in a moment):

>>>file.seek(0)                                # go back to the front of file
>>> file.read( )                                # read entire file into string
'Hello file world!\nBye   file world.\n'

>>> file.seek(0)
>>> file.readlines( )
['Hello file world!\n', 'Bye   file world.\n']

>>> file.seek(0)
>>> file.readline( )                            # read one line at a time
'Hello file world!\n'
>>> file.readline( )
'Bye   file world.\n'
>>> file.readline( )                            # empty string at end-of-file
''

>>> file.seek(0)
>>> file.read(1), file.read(8)
('H', 'ello fil')

All of these input methods let us be specific about how much to fetch. Here are a few rules of thumb about which to choose:

  • read( ) and readlines( ) load the entire file into memory all at once. That makes them handy for grabbing a file’s contents with as little code as possible. It also makes them very fast, but costly for huge files—loading a multigigabyte file into memory is not generally a good thing to do.

  • On the other hand, because the readline( ) and read(N) calls fetch just part of the file (the next line, or N-byte block), they are safer for potentially big files but a bit less convenient and usually much slower. Both return an empty string when they reach end-of-file. If speed matters and your files aren’t huge, read or readlines may be a better choice.

  • See also the discussion of the newer file iterators in the next section. Iterators provide the convenience of readlines( ) with the space efficiency of readline( ).

By the way, the seek(0) call used repeatedly here means “go back to the start of the file.” In our example, it is an alternative to reopening the file each time. In files, all read and write operations take place at the current position; files normally start at offset 0 when opened and advance as data is transferred. The seek call simply lets us move to a new position for the next transfer operation.

Python’s seek method also accepts an optional second argument that has one of three values—0 for absolute file positioning (the default), 1 to seek relative to the current position, and 2 to seek relative to the file’s end. When seek is passed only an offset argument of 0, as shown earlier, it’s roughly a file rewind operation.

Reading lines with file iterators

The traditional way to read a file line by line that you saw in the prior section:

>>>file = open('data.txt')                 # open input file object
>>> for line in file.readlines( ):          # read into line string list
...     print line,

is actually more work than is needed today. In recent Pythons, the file object includes an iterator which is smart enough to grab just one more line per request in iteration contexts such as for loops and list comprehensions. Iterators are simply objects with next methods. The practical benefit of this extension is that you no longer need to call .readlines in a for loop to scan line by line; the iterator reads lines on request:

>>>file = open('data.txt')
>>> for line in file:                  # no need to call readlines
...     print line,                    # iterator reads next line each time
...
Hello file world!
Bye   file world.

>>> for line in open('data.txt'):      # even shorter: temporary file object
...     print line,
...
Hello file world!
Bye   file world.

Moreover, the iterator form does not load the entire file into a line’s list all at once, so it will be more space efficient for large text files. Because of that, this is the prescribed way to read line by line today; when in doubt, let Python do your work automatically. If you want to see what really happens inside the for loop, you can use the iterator manually; it’s similar to calling the readline method each time through, but read methods return an empty string at end-of-file (EOF), whereas the iterator raises an exception to end the iteration:

>>>file = open('data.txt')      # read methods: empty at EOF
>>> file.readline( )
'Hello file world!\n'
>>> file.readline( )
'Bye   file world.\n'
>>> file.readline( )
''

>>> file = open('data.txt')      # iterators: exception at EOF
>>> file.next( )
'Hello file world!\n'
>>> file.next( )
'Bye   file world.\n'
>>> file.next( )
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
StopIteration

Interestingly, iterators are automatically used in all iteration contexts, including the list constructor call, list comprehension expressions, map calls, and in membership checks:

>>>open('data.txt').readlines( )
['Hello file world!\n', 'Bye   file world.\n']

>>> list(open('data.txt'))
['Hello file world!\n', 'Bye   file world.\n']

>>> lines = [line.rstrip( ) for line in open('data.txt')]        # or [:-1]
>>> lines
['Hello file world!', 'Bye   file world.']

>>> lines = [line.upper( ) for line in open('data.txt')]
>>> lines
['HELLO FILE WORLD!\n', 'BYE   FILE WORLD.\n']

>>> map(str.split, open('data.txt'))
[['Hello', 'file', 'world!'], ['Bye', 'file', 'world.']]

>>> line = 'Hello file world!\n'
>>> line in open('data.txt')
True

Iterators may seem somewhat implicit at first glance, but they represent the ways that Python makes developers’ lives easier over time.[*]

Other file object modes

Besides w and r, most platforms support an a open mode string, meaning “append.” In this output mode, write methods add data to the end of the file, and the open call will not erase the current contents of the file:

>>>file = open('data.txt', 'a')          # open in append mode: doesn't erase
>>> file.write('The Life of Brian')       # added at end of existing data
>>> file.close( )
>>>
>>> open('data.txt').read( )               # open and read entire file
'Hello file world!\nBye   file world.\nThe Life of Brian'

Most files are opened using the sorts of calls we just ran, but open actually allows up to three arguments for more specific processing needs—the filename, the open mode, and a buffer size. All but the first of these are optional: if omitted, the open mode argument defaults to r (input), and the buffer size policy is to enable buffering on most platforms. Here are a few things you should know about all three open arguments:

Filename

As mentioned earlier, filenames can include an explicit directory path to refer to files in arbitrary places on your computer; if they do not, they are taken to be names relative to the current working directory (described in the prior chapter). In general, any filename form you can type in your system shell will work in an open call. For instance, a filename argument r'..\temp\spam.txt' on Windows means spam.txt in the temp subdirectory of the current working directory’s parent—up one, and down to directory temp.

Open mode

The open function accepts other modes too, some of which are not demonstrated in this book (e.g., r+, w+, and a+ to open for updating, and any mode string with a b to designate binary mode). For instance, mode r+ means both reads and writes are allowed on an existing file; w+ allows reads and writes but creates the file anew, erasing any prior content; and wb writes data in binary mode (more on this in the next section). Generally, whatever you could use as a mode string in the C language’s fopen call on your platform will work in the Python open function, since it really just calls fopen internally. (If you don’t know C, don’t sweat this point.) Notice that the contents of files are always strings in Python programs, regardless of mode: read methods return a string, and we pass a string to write methods.

Buffer size

The open call also takes an optional third buffer size argument, which lets you control stdio buffering for the file—the way that data is queued up before being transferred to boost performance. If passed, 0 means file operations are unbuffered (data is transferred immediately), 1 means they are line buffered, any other positive value means to use a buffer of approximately that size, and a negative value means to use the system default (which you get if no third argument is passed and which generally means buffering is enabled). The buffer size argument works on most platforms, but it is currently ignored on platforms that don’t provide the sevbuf system call.

Binary datafiles

All of the preceding examples process simple text files. Python scripts can also open and process files containing binary data—JPEG images, audio clips, packed binary data produced by FORTRAN and C programs, and anything else that can be stored in files. The primary difference in terms of your code is the mode argument passed to the built-in open function:

>>>file = open('data.txt', 'wb')      # open binary output file
>>> file = open('data.txt', 'rb')      # open binary input file

Once you’ve opened binary files in this way, you may read and write their contents using the same methods just illustrated: read, write, and so on. (readline and readlines don’t make sense here, though: binary data isn’t line oriented.)

In all cases, data transferred between files and your programs is represented as Python strings within scripts, even if it is binary data. This works because Python string objects can always contain character bytes of any value (though some may look odd if printed). Interestingly, even a byte of value zero can be embedded in a Python string; it’s called \0 in escape-code notation and does not terminate strings in Python as it typically does in C. For instance:

>>>data = 'a\0b\0c'
>>> data
'a\x00b\x00c'
>>> len(data)
5

Instead of relying on a terminator character, Python keeps track of a string’s length explicitly. Here, data references a string of length 5 that happens to contain two zero-value bytes; they print in hexadecimal escape sequence form as \x00 (Python uses escapes to display all nonprintable characters). Because no character codes are reserved, it’s OK to read binary data with zero bytes (and other values) into a string in Python.

End-of-line translations on Windows

Strictly speaking, on some platforms you may not need the b at the end of the open mode argument to process binary files; the b is simply ignored, so modes r and w work just as well. In fact, the b in mode flag strings is usually required only for binary files on Windows. To understand why, though, you need to know how lines are terminated in text files.

For historical reasons, the end of a line of text in a file is represented by different characters on different platforms: it’s a single \n character on Unix and Linux, but the two-character sequence \r\n on Windows.[*] That’s why files moved between Linux and Windows may look odd in your text editor after transfer—they may still be stored using the original platform’s end-of-line convention. For example, most Windows editors handle text in Unix format, but Notepad is a notable exception—text files copied from Unix or Linux usually look like one long line when viewed in Notepad, with strange characters inside (\n). Similarly, transferring a file from Windows to Unix in binary mode retains the \r characters (which usually appear as ^M in text editors).

Python scripts don’t normally have to care, because the Windows port (actually, the underlying C compiler on Windows) automatically maps the DOS \r\n sequence to a single \n. It works like this—when scripts are run on Windows:

  • For files opened in text mode, \r\n is translated to \n when input.

  • For files opened in text mode, \n is translated to \r\n when output.

  • For files opened in binary mode, no translation occurs on input or output.

  • On Unix-like platforms, no translations occur, regardless of open modes.

You should keep in mind two important consequences of all of these rules. First, the end-of-line character is almost always represented as a single \n in all Python scripts, regardless of how it is stored in external files on the underlying platform. By mapping to and from \n on input and output, the Windows port hides the platform-specific difference.

The second consequence of the mapping is subtler: if you mean to process binary data files on Windows, you generally must be careful to open those files in binary mode (rb, wb), not in text mode (r, w). Otherwise, the translations listed previously could very well corrupt data as it is input or output. It’s not impossible that binary data would by chance contain bytes with values the same as the DOS end-of-line characters, \r and \n. If you process such binary files in text mode on Windows, \r bytes may be incorrectly discarded when read and \n bytes may be erroneously expanded to \r\n when written. The net effect is that your binary data will be trashed when read and written—probably not quite what you want! For example, on Windows:

>>>len('a\0b\rc\r\nd')                            # 4 escape code bytes
8
>>> open('temp.bin', 'wb').write('a\0b\rc\r\nd')   # write binary data to file

>>> open('temp.bin', 'rb').read( )                 # intact if read as binary
'a\x00b\rc\r\nd'

>>> open('temp.bin', 'r').read( )                  # loses a \r in text mode!
'a\x00b\rc\nd'

>>> open('temp.bin', 'w').write('a\0b\rc\r\nd')     # adds a \r in text mode!
>>> open('temp.bin', 'rb').read( )
'a\x00b\rc\r\r\nd'

This is an issue only when running on Windows, but using binary open modes rb and wb for binary files everywhere won’t hurt on other platforms and will help make your scripts more portable (you never know when a Unix utility may wind up seeing action on your Windows machine).

You may want to use binary file open modes at other times as well. For instance, in Chapter 7, we’ll meet a script called fixeoln_one that translates between DOS and Unix end-of-line character conventions in text files. Such a script also has to open text files in binary mode to see what end-of-line characters are truly present on the file; in text mode, they would already be translated to \n by the time they reached the script.

Parsing packed binary data with the struct module

By using the letter b in the open call, you can open binary datafiles in a platform-neutral way and read and write their content with normal file object methods. But how do you process binary data once it has been read? It will be returned to your script as a simple string of bytes, most of which are not printable characters (that’s why Python displays them with \xNN hexadecimal escape sequences).

If you just need to pass binary data along to another file or program, your work is done. And if you just need to extract a number of bytes from a specific position, string slicing will do the job. To get at the deeper contents of binary data, though, as well as to construct its contents, the standard library struct module is more powerful.

The struct module provides calls to pack and unpack binary data, as though the data was laid out in a C-language struct declaration. It is also capable of composing and decomposing using any endian-ness you desire (endian-ness determines whether the most significant bits are on the left or on the right). Building a binary datafile, for instance, is straightforward: pack Python values into a string and write them to a file. The format string here in the pack call means big-endian (>), with an integer, four-character string, half integer, and float:

>>>import struct
>>> data = struct.pack('>i4shf', 2, 'spam', 3, 1.234)
>>> data
'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>> file = open('data.bin', 'wb')
>>> file.write(data)
>>> file.close( )

As usual, Python displays here most of the packed binary data’s bytes with \xNN hexadecimal escape sequences, because the bytes are not printable characters. To parse data like that which we just produced, read it off the file and pass it to the struct module with the same format string; you get back a tuple containing the values parsed out of the string and converted to Python objects:

>>>import struct
>>> file   = open('data.bin', 'rb')
>>> bytes  = file.read( )
>>> values = struct.unpack('>i4shf', data)
>>> values
(2, 'spam', 3, 1.2339999675750732)

For more details, see the struct module’s entry in the Python library manual. Also note that slicing comes in handy in this domain; to grab just the four-character string in the middle of the packed binary data we just read, we can simply slice it out. Numeric values could similarly be sliced out and then passed to struct.unpack for conversion:

>>>bytes
'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>> string = bytes[4:8]
>>> string
'spam'

>>> number = bytes[8:10]
>>> number
'\x00\x03'
>>> struct.unpack('>h', number)
(3,)

File Tools in the os Module

The os module contains an additional set of file-processing functions that are distinct from the built-in file object tools demonstrated in previous examples. For instance, here is a very partial list of os file-related calls:

os.open( path, flags, mode )

Opens a file and returns its descriptor

os.read( descriptor, N )

Reads at most N bytes and returns a string

os.write( descriptor, string )

Writes bytes in string to the file

os.lseek( descriptor, position )

Moves to position in the file

Technically, os calls process files by their descriptors, which are integer codes or “handles” that identify files in the operating system. Because the descriptor-based file tools in os are lower level and more complex than the built-in file objects created with the built-in open function, you should generally use the latter for all but very special file-processing needs.[*]

To give you the general flavor of this tool set, though, let’s run a few interactive experiments. Although built-in file objects and os module descriptor files are processed with distinct tool sets, they are in fact related—the stdio filesystem used by file objects simply adds a layer of logic on top of descriptor-based files.

In fact, the fileno file object method returns the integer descriptor associated with a built-in file object. For instance, the standard stream file objects have descriptors 0, 1, and 2; calling the os.write function to send data to stdout by descriptor has the same effect as calling the sys.stdout.write method:

>>>import sys
>>> for stream in (sys.stdin, sys.stdout, sys.stderr):
...     print stream.fileno( ),
...
0 1 2

>>> sys.stdout.write('Hello stdio world\n')        # write via file method
Hello stdio world

>>> import os
>>> os.write(1, 'Hello descriptor world\n')        # write via os module
Hello descriptor world
23

Because file objects we open explicitly behave the same way, it’s also possible to process a given real external file on the underlying computer through the built-in open function, tools in the os module, or both:

>>>file = open(r'C:\temp\spam.txt', 'w')          # create external file
>>> file.write('Hello stdio file\n')               # write via file method
>>>
>>> fd = file.fileno( )
>>> print fd
3
>>> os.write(fd, 'Hello descriptor file\n')        # write via os module
22
>>> file.close( )
>>>
C:\WINDOWS>type c:\temp\spam.txt                   # both writes show up
Hello descriptor file
Hello stdio file

Open mode flags

So why the extra file tools in os? In short, they give more low-level control over file processing. The built-in open function is easy to use but is limited by the underlying stdio filesystem that it wraps; buffering, open modes, and so on, are all per-stdio defaults.[*] The os module lets scripts be more specific—for example, the following opens a descriptor-based file in read-write and binary modes by performing a binary “or” on two mode flags exported by os:

>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>> os.read(fdfile, 20)
'Hello descriptor fil'
>>> os.lseek(fdfile, 0, 0)                        # go back to start of file
0
>>> os.read(fdfile, 100)                          # binary mode retains "\r\n"
'Hello descriptor file\r\nHello stdio file\r\n'

>>> os.lseek(fdfile, 0, 0)
0
>>> os.write(fdfile, 'HELLO')                     # overwrite first 5 bytes
5

On some systems, such open flags let us specify more advanced things like exclusive access (O_EXCL) and nonblocking modes (O_NONBLOCK) when a file is opened. Some of these flags are not portable across platforms (another reason to use built-in file objects most of the time); see the library manual or run a dir(os) call on your machine for an exhaustive list of other open flags available.

We saw earlier how to go from file object to field descriptor with the fileno file method; we can also go the other way—the os.fdopen call wraps a file descriptor in a file object. Because conversions work both ways, we can generally use either tool set—file object or os module:

>>>objfile = os.fdopen(fdfile)
>>> objfile.seek(0)
>>> objfile.read( )
'HELLO descriptor file\r\nHello stdio file\r\n'

Tip

Using os.open with the O_EXCL flag is the most portable way to lock files for concurrent updates or other process synchronization in Python today. Another module, fcntl, also provides file-locking tools but is not as widely available across platforms. As of this writing, locking with os.open is supported in Windows, Unix, and Macintosh; fcntl works only on Unix.

Other os file tools

The os module also includes an assortment of file tools that accept a file pathname string and accomplish file-related tasks such as renaming (os.rename), deleting (os.remove), and changing the file’s owner and permission settings (os.chown, os.chmod). Let’s step through a few examples of these tools in action:

>>>os.chmod('spam.txt', 0777)           # enabled all accesses

This os.chmod file permissions call passes a 9-bit string composed of three sets of three bits each. From left to right, the three sets represent the file’s owning user, the file’s group, and all others. Within each set, the three bits reflect read, write, and execute access permissions. When a bit is “1” in this string, it means that the corresponding operation is allowed for the assessor. For instance, octal 0777 is a string of nine “1” bits in binary, so it enables all three kinds of accesses for all three user groups; octal 0600 means that the file can be read and written only by the user that owns it (when written in binary, 0600 octal is really bits 110 000 000).

This scheme stems from Unix file permission settings, but it works on Windows as well. If it’s puzzling, either check a Unix manpage for chmod or see the fixreadonly example in Chapter 7 for a practical application (it makes read-only files that are copied off a CD-ROM writable).

>>>os.rename(r'C:\temp\spam.txt', r'C:\temp\eggs.txt')      # (from, to)
>>>
>>> os.remove(r'C:\temp\spam.txt')                           # delete file
Traceback (innermost last):
  File "<stdin>", line 1, in ?
OSError: [Errno 2] No such file or directory: 'C:\\temp\\spam.txt'
>>>
>>> os.remove(r'C:\temp\eggs.txt')

The os.rename call used here changes a file’s name; the os.remove file deletion call deletes a file from your system and is synonymous with os.unlink (the latter reflects the call’s name on Unix but was obscure to users of other platforms). The os module also exports the stat system call:

>>>import os
>>> info = os.stat(r'C:\temp\spam.txt')
>>> info
(33206, 0, 2, 1, 0, 0, 41, 968133600, 968176258, 968176193)

>>> import stat
>>> info[stat.ST_MODE], info[stat.ST_SIZE]
(33206, 41)

>>> mode = info[stat.ST_MODE]
>>> stat.S_ISDIR(mode), stat.S_ISREG(mode)
(0, 1)

The os.stat call returns a tuple of values giving low-level information about the named file, and the stat module exports constants and functions for querying this information in a portable way. For instance, indexing an os.stat result on offset stat.ST_SIZE returns the file’s size, and calling stat.S_ISDIR with the mode item from an os.stat result checks whether the file is a directory. As shown earlier, though, both of these operations are available in the os.path module too, so it’s rarely necessary to use os.stat except for low-level file queries:

>>>path = r'C:\temp\spam.txt'
>>> os.path.isdir(path), os.path.isfile(path), os.path.getsize(path)
(0, 1, 41)

File Scanners

Unlike some shell-tool languages, Python doesn’t have an implicit file-scanning loop procedure, but it’s simple to write a general one that we can reuse for all time. The module in Example 4-1 defines a general file-scanning routine, which simply applies a passed-in Python function to each line in an external file.

Example 4-1. PP3E\System\Filetools\scanfile.py

def scanner(name, function):
    file = open(name, 'r')               # create a file object
    while 1:
        line = file.readline( )          # call file methods
        if not line: break               # until end-of-file
        function(line)                   # call a function object
    file.close( )

The scanner function doesn’t care what line-processing function is passed in, and that accounts for most of its generality—it is happy to apply any single-argument function that exists now or in the future to all of the lines in a text file. If we code this module and put it in a directory on PYTHONPATH, we can use it any time we need to step through a file line by line. Example 4-2 is a client script that does simple line translations.

Example 4-2. PP3E\System\Filetools\commands.py

#!/usr/local/bin/python
from sys import argv
from scanfile import scanner
class UnknownCommand(Exception): pass

def processLine(line):                      # define a function
    if line[0] == '*':                      # applied to each line
        print "Ms.", line[1:-1]
    elif line[0] == '+':
        print "Mr.", line[1:-1]             # strip first and last char: \n
    else:
        raise UnknownCommand, line          # raise an exception

filename = 'data.txt'
if len(argv) == 2: filename = argv[1]       # allow filename cmd arg
scanner(filename, processLine)              # start the scanner

The text file hillbillies.txt contains the following lines:

*Granny
+Jethro
*Elly May
+"Uncle Jed"

and our commands script could be run as follows:

C:\...\PP3E\System\Filetools>python commands.py hillbillies.txt
Ms. Granny
Mr. Jethro
Ms. Elly May
Mr. "Uncle Jed"

Notice that we could also code the command processor in the following way; especially if the number of command options starts to become large, such a data-driven approach may be more concise and easier to maintain than a large if statement with essentially redundant actions (if you ever have to change the way output lines print, you’ll have to change it in only one place with this form):

commands = {'*': 'Ms.', '+': 'Mr.'}     # data is easier to expand than code?

def processLine(line):
    try:
        print commands[line[0]], line[1:-1]
    except KeyError:
        raise UnknownCommand, line

As a rule of thumb, we can also usually speed things up by shifting processing from Python code to built-in tools. For instance, if we’re concerned with speed (and memory space isn’t tight), we can make our file scanner faster by using the readlines method to load the file into a list all at once instead of using the manual readline loop in Example 4-1:

def scanner(name, function):
    file = open(name, 'r')               # create a file object
    for line in file.readlines( ):       # get all lines at once
        function(line)                   # call a function object
    file.close( )

A file iterator will do the same work but will not load the entire file into memory all at once:

def scanner(name, function):
    for line in open(name, 'r'):        # scan line by line
        function(line)                  # call a function object
    file.close( )

And if we have a list of lines, we can work more magic with the map built-in function or list comprehension expression. Here are two minimalist’s versions; the for loop is replaced by map or a comprehension, and we let Python close the file for us when it is garbage collected or the script exits (both of these build a temporary list of results along the way, which is likely trivial for all but the largest of files):

def scanner(name, function):
    map(function, open(name, 'r'))

def scanner(name, function):
    [function(line) for line in open(name, 'r')]

But what if we also want to change a file while scanning it? Example 4-3 shows two approaches: one uses explicit files, and the other uses the standard input/output streams to allow for redirection on the command line.

Example 4-3. PP3E\System\Filetools\filters.py

def filter_files(name, function):         # filter file through function
    input  = open(name, 'r')              # create file objects
    output = open(name + '.out', 'w')     # explicit output file too
    for line in input:
        output.write(function(line))      # write the modified line
    input.close( )
    output.close( )                        # output has a '.out' suffix

def filter_stream(function):
    import sys                            # no explicit files
    while 1:                              # use standard streams
        line = sys.stdin.readline()       # or: raw_input( )
        if not line: break
        print function(line),             # or: sys.stdout.write( )

if _ _name_ _ == '_ _main_ _':
    filter_stream(lambda line: line)      # copy stdin to stdout if run

Since the standard streams are preopened for us, they’re often easier to use. This module is more useful when imported as a library (clients provide the line-processing function); when run standalone it simply parrots stdin to stdout:

C:\...\PP3E\System\Filetools>python filters.py < ..\System.txt
This directory contains operating system interface examples.

Many of the examples in this unit appear elsewhere in the examples
distribution tree, because they are actually used to manage other
programs. See the README.txt files in the subdirectories here
for pointers.

Tip

Brutally observant readers may notice that this last file is named filters.py (with an s), not filter.py. I originally named it the latter but changed its name when I realized that a simple import of the filename (e.g., “import filter”) assigns the module to a local name “filter,” thereby hiding the built-in filter function. This is a built-in functional programming tool that is not used very often in typical scripts. And as mentioned earlier, redefining built-in names this way is not an issue unless you really need to use the built-in version of the name. But as a general rule of thumb, be careful to avoid picking built-in names for module files. I will if you will.



[*] Technically, you can use the name file anywhere you use open, though open is still the generally preferred call unless you are subclassing to customize files. We’ll use open in most of our examples. As for all built-in names, it’s OK to use the name file for your own variables as long as you don’t need direct access to the built-in file datatype (your file name will hide the built-in scope’s file). In fact, this is such a common practice that we’ll frequently follow it here. This is not a sin, but you should generally be careful about reusing built-in names in this way.

[*] This is so useful that I was able to remove an entire section from this chapter in this edition, which wrapped a file object in a class to allow iteration over lines in a for loop. In fact, that example became completely superfluous and no longer worked as described after the second edition of this book. Technically, its _ _getitem_ _ indexing overload method was never called anymore because for loops now look for a file object’s _ _iter_ _ iteration method first. You don’t have to know what that means, because iteration is a core feature of file objects today.

[*] Actually, it gets worse: on the classic Mac, lines in text files are terminated with a single \r (not \n or \r\n). The more modern Mac is a Unix-based machine and normally follows that platform’s conventions instead. Whoever said proprietary software was good for the consumer probably wasn’t speaking about users of multiple platforms, and certainly wasn’t talking about programmers.

[*] For instance, to process pipes, described in Chapter 5. The Python pipe call returns two file descriptors, which can be processed with os module tools or wrapped in a file object with os.fdopen.

[*] To be fair to the built-in file object, the open function accepts an rb+ mode, which is equivalent to the combined mode flags used here and can also be made nonbuffered with a buffer size argument. Whenever possible, use open, not os.open.

Get Programming Python, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.