Programming Python, 4th Edition

Chapter 4. File and Directory Tools

“Erase Your Hard Drive in Five Easy Steps!”

This chapter continues our look at system interfaces in Python by focusing on file and directory-related tools. As you’ll see, it’s easy to process files and directory trees with Python’s built-in and standard library support. Because files are part of the core Python language, some of this chapter’s material is a review of file basics covered in books like Learning Python, Fourth Edition, and we’ll defer to such resources for more background details on some file-related concepts. For example, iteration, context managers, and the file object’s support for Unicode encodings are demonstrated along the way, but these topics are not repeated in full here. This chapter’s goal is to tell enough of the file story to get you started writing useful scripts.

File Tools

External files are at the heart of much of what we do with system utilities. For instance, a testing system may read its inputs from one file, store program results in another file, and check expected results by loading yet another file. Even user interface and Internet-oriented programs may load binary images and audio clips from files on the underlying computer. It’s a core programming concept.

In Python, the built-in open function is the primary tool scripts use to access the files on the underlying computer system. Since this function is an inherent part of the Python language, you may already be familiar with its basic workings. When called, the open function returns a new file object that is connected to the external file; the file object has methods that transfer data to and from the file and perform a variety of file-related operations. The open function also provides a portable interface to the underlying filesystem—it works the same way on every platform on which Python runs.

Other file-related modules built into Python allow us to do things such as manipulate lower-level descriptor-based files (os); copy, remove, and move files and collections of files (os and shutil); store data and objects in files by key (dbm and shelve); and access SQL databases (sqlite3 and third-party add-ons). The last two of these categories are related to database topics, addressed in Chapter 17.

In this section, we’ll take a brief tutorial look at the built-in file object and explore a handful of more advanced file-related topics. As usual, you should consult either Python’s library manual or reference books such as Python Pocket Reference for further details and methods we don’t have space to cover here. Remember, for quick interactive help, you can also run dir(file) on an open file object to see an attributes list that includes methods; help(file) for general help; and help(file.read) for help on a specific method such as read, though the file object implementation in 3.1 provides less information for help than the library manual and other resources.

The File Object Model in Python 3.X

Just like the string types we noted in Chapter 2, file support in Python 3.X is a bit richer than it was in the past. As we noted earlier, in Python 3.X str strings always represent Unicode text (ASCII or wider), and bytes and bytearray strings represent raw binary data. Python 3.X draws a similar and related distinction between files containing text and binary data:

Text files contain Unicode text. In your script, text file content is always a str string—a sequence of characters (technically, Unicode “code points”). Text files perform the automatic line-end translations described in this chapter by default and automatically apply Unicode encodings to file content: they encode to and decode from raw binary bytes on transfers to and from the file, according to a provided or default encoding name. Encoding is trivial for ASCII text, but may be sophisticated in other cases.
Binary files contain raw 8-bit bytes. In your script, binary file content is always a byte string, usually a bytes object—a sequence of small integers, which supports most str operations and displays as ASCII characters whenever possible. Binary files perform no translations of data when it is transferred to and from files: no line-end translations or Unicode encodings are performed.

In practice, text files are used for all truly text-related data, and binary files store items like packed binary data, images, audio files, executables, and so on. As a programmer you distinguish between the two file types in the mode string argument you pass to open: adding a “b” (e.g., 'rb', 'wb') means the file contains binary data. For coding new file content, use normal strings for text (e.g., 'spam' or bytes.decode()) and byte strings for binary (e.g., b'spam' or str.encode()).

Unless your file scope is limited to ASCII text, the 3.X text/binary distinction can sometimes impact your code. Text files create and require str strings, and binary files use byte strings; because you cannot freely mix the two string types in expressions, you must choose file mode carefully. Many built-in tools we’ll use in this book make the choice for us; the struct and pickle modules, for instance, deal in byte strings in 3.X, and the xml package in Unicode str. You must even be aware of the 3.X text/binary distinction when using system tools like pipe descriptors and sockets, because they transfer data as byte strings today (though their content can be decoded and encoded as Unicode text if needed).

Moreover, because text-mode files require that content be decodable per a Unicode encoding scheme, you must read undecodable file content in binary mode, as byte strings (or catch Unicode exceptions in try statements and skip the file altogether). This may include both truly binary files as well as text files that use encodings that are nondefault and unknown. As we’ll see later in this chapter, because str strings are always Unicode in 3.X, it’s sometimes also necessary to select byte string mode for the names of files in directory tools such as os.listdir, glob.glob, and os.walk if they cannot be decoded (passing in byte strings essentially suppresses decoding).

In fact, we’ll see examples where the Python 3.X distinction between str text and bytes binary pops up in tools beyond basic files throughout this book—in Chapters 5 and 12 when we explore sockets; in Chapters 6 and 11 when we’ll need to ignore Unicode errors in file and directory searches; in Chapter 12, where we’ll see how client-side Internet protocol modules such as FTP and email, which run atop sockets, imply file modes and encoding requirements; and more.

But just as for string types, although we will see some of these concepts in action in this chapter, we’re going to take much of this story as a given here. File and string objects are core language material and are prerequisite to this text. As mentioned earlier, because they are addressed by a 45-page chapter in the book Learning Python, Fourth Edition, I won’t repeat their coverage in full in this book. If you find yourself confused by the Unicode and binary file and string concepts in the following sections, I encourage you to refer to that text or other resources for more background information in this domain.

Using Built-in File Objects

Despite the text/binary dichotomy in Python 3.X, files are still very straightforward to use. For most purposes, in fact, the open built-in function and its files objects are all you need to remember to process files in your scripts. The file object returned by open has methods for reading data (read, readline, readlines); writing data (write, writelines); freeing system resources (close); moving to arbitrary positions in the file (seek); forcing data in output buffers to be transferred to disk (flush); fetching the underlying file handle (fileno); and more. Since the built-in file object is so easy to use, let’s jump right into a few interactive examples.

Output files

To make a new file, call open with two arguments: the external name of the file to be created and a mode string w (short for write). To store data on the file, call the file object’s write method with a string containing the data to store, and then call the close method to close the file. File write calls return the number of characters or bytes written (which we’ll sometimes omit in this book to save space), and as we’ll see, close calls are often optional, unless you need to open and read the file again during the same program or session:

C:\temp> python
>>> file = open('data.txt', 'w')            # open output file object: creates
>>> file.write('Hello file world!\n')       # writes strings verbatim
18
>>> file.write('Bye   file world.\n')       # returns number chars/bytes written
18
>>> file.close()                            # closed on gc and exit too

And that’s it—you’ve just generated a brand-new text file on your computer, regardless of the computer on which you type this code:

C:\temp> dir data.txt /B
data.txt

C:\temp> type data.txt
Hello file world!
Bye   file world.

There is nothing unusual about the new file; here, I use the DOS dir and type commands to list and display the new file, but it shows up in a file explorer GUI, too.

Opening

In the open function call shown in the preceding example, the first argument can optionally specify a complete directory path as part of the filename string. If we pass just a simple filename without a path, the file will appear in Python’s current working directory. That is, it shows up in the place where the code is run. Here, the directory C:\temp on my machine is implied by the bare filename data.txt, so this actually creates a file at C:\temp\data.txt. More accurately, the filename is relative to the current working directory if it does not include a complete absolute directory path. See Current Working Directory (Chapter 3), for a refresher on this topic.

Also note that when opening in w mode, Python either creates the external file if it does not yet exist or erases the file’s current contents if it is already present on your machine (so be careful out there—you’ll delete whatever was in the file before).

Writing

Notice that we added an explicit \n end-of-line character to lines written to the file; unlike the print built-in function, file object write methods write exactly what they are passed without adding any extra formatting. The string passed to write shows up character for character on the external file. In text files, data written may undergo line-end or Unicode translations which we’ll describe ahead, but these are undone when the data is later read back.

Output files also sport a writelines method, which simply writes all of the strings in a list one at a time without adding any extra formatting. For example, here is a writelines equivalent to the two write calls shown earlier:

file.writelines(['Hello file world!\n', 'Bye   file world.\n'])

This call isn’t as commonly used (and can be emulated with a simple for loop or other iteration tool), but it is convenient in scripts that save output in a list to be written later.

Closing

The file close method used earlier finalizes file contents and frees up system resources. For instance, closing forces buffered output data to be flushed out to disk. Normally, files are automatically closed when the file object is garbage collected by the interpreter (that is, when it is no longer referenced). This includes all remaining open files when the Python session or program exits. Because of that, close calls are often optional. In fact, it’s common to see file-processing code in Python in this idiom:

open('somefile.txt', 'w').write("G'day Bruce\n")       # write to temporary object
open('somefile.txt', 'r').read()                       # read from temporary object

Since both these expressions make a temporary file object, use it immediately, and do not save a reference to it, the file object is reclaimed right after data is transferred, and is automatically closed in the process. There is usually no need for such code to call the close method explicitly.

In some contexts, though, you may wish to explicitly close anyhow:

For one, because the Jython implementation relies on Java’s garbage collector, you can’t always be as sure about when files will be reclaimed as you can in standard Python. If you run your Python code with Jython, you may need to close manually if many files are created in a short amount of time (e.g. in a loop), in order to avoid running out of file resources on operating systems where this matters.
For another, some IDEs, such as Python’s standard IDLE GUI, may hold on to your file objects longer than you expect (in stack tracebacks of prior errors, for instance), and thus prevent them from being garbage collected as soon as you might expect. If you write to an output file in IDLE, be sure to explicitly close (or flush) your file if you need to reliably read it back during the same IDLE session. Otherwise, output buffers might not be flushed to disk and your file may be incomplete when read.
And while it seems very unlikely today, it’s not impossible that this auto-close on reclaim file feature could change in future. This is technically a feature of the file object’s implementation, which may or may not be considered part of the language definition over time.

For these reasons, manual close calls are not a bad idea in nontrivial programs, even if they are technically not required. Closing is a generally harmless but robust habit to form.

Ensuring file closure: Exception handlers and context managers

Manual file close method calls are easy in straight-line code, but how do you ensure file closure when exceptions might kick your program beyond the point where the close call is coded? First of all, make sure you must—files close themselves when they are collected, and this will happen eventually, even when exceptions occur.

If closure is required, though, there are two basic alternatives: the try statement’s finally clause is the most general, since it allows you to provide general exit actions for any type of exceptions:

myfile = open(filename, 'w')
try:
    ...process myfile...
finally:
    myfile.close()

In recent Python releases, though, the with statement provides a more concise alternative for some specific objects and exit actions, including closing files:

with open(filename, 'w') as myfile:
    ...process myfile, auto-closed on statement exit...

This statement relies on the file object’s context manager: code automatically run both on statement entry and on statement exit regardless of exception behavior. Because the file object’s exit code closes the file automatically, this guarantees file closure whether an exception occurs during the statement or not.

The with statement is notably shorter (3 lines) than the try/finally alternative, but it’s also less general—with applies only to objects that support the context manager protocol, whereas try/finally allows arbitrary exit actions for arbitrary exception contexts. While some other object types have context managers, too (e.g., thread locks), with is limited in scope. In fact, if you want to remember just one exit actions option, try/finally is the most inclusive. Still, with yields less code for files that must be closed and can serve well in such specific roles. It can even save a line of code when no exceptions are expected (albeit at the expense of further nesting and indenting file processing logic):

myfile = open(filename, 'w')               # traditional form
...process myfile...
myfile.close()

with open(filename) as myfile:             # context manager form
    ...process myfile...

In Python 3.1 and later, this statement can also specify multiple (a.k.a. nested) context managers—any number of context manager items may be separated by commas, and multiple items work the same as nested with statements. In general terms, the 3.1 and later code:

with A() as a, B() as b:
    ...statements...

Runs the same as the following, which works in 3.1, 3.0, and 2.6:

with A() as a:
    with B() as b:
        ...statements...

For example, when the with statement block exits in the following, both files’ exit actions are automatically run to close the files, regardless of exception outcomes:

with open('data') as fin, open('results', 'w') as fout:
    for line in fin:
        fout.write(transform(line))

Context manager–dependent code like this seems to have become more common in recent years, but this is likely at least in part because newcomers are accustomed to languages that require manual close calls in all cases. In most contexts there is no need to wrap all your Python file-processing code in with statements—the files object’s auto-close-on-collection behavior often suffices, and manual close calls are enough for many other scripts. You should use the with or try options outlined here only if you must close, and only in the presence of potential exceptions. Since standard C Python automatically closes files on collection, though, neither option is required in many (and perhaps most) scripts.

Input files

Reading data from external files is just as easy as writing, but there are more methods that let us load data in a variety of modes. Input text files are opened with either a mode flag of r (for “read”) or no mode flag at all—it defaults to r if omitted, and it commonly is. Once opened, we can read the lines of a text file with the readlines method:

C:\temp> python
>>> file = open('data.txt')                  # open input file object: 'r' default
>>> lines = file.readlines()                 # read into line string list
>>> for line in lines:                       # BUT use file line iterator! (ahead)
...     print(line, end='')                  # lines have a '\n' at end
...
Hello file world!
Bye   file world.

The readlines method loads the entire contents of the file into memory and gives it to our scripts as a list of line strings that we can step through in a loop. In fact, there are many ways to read an input file:

file.read(): Returns a string containing all the characters (or bytes) stored in the file
file.read(N): Returns a string containing the next N characters (or bytes) from the file
file.readline(): Reads through the next \n and returns a line string
file.readlines(): Reads the entire file and returns a list of line strings

Let’s run these method calls to read files, lines, and characters from a text file—the seek(0) call is used here before each test to rewind the file to its beginning (more on this call in a moment):

>>> file.seek(0)                               # go back to the front of file
>>> file.read()                                # read entire file into string
'Hello file world!\nBye   file world.\n'

>>> file.seek(0)                               # read entire file into lines list
>>> file.readlines()
['Hello file world!\n', 'Bye   file world.\n']

>>> file.seek(0)
>>> file.readline()                            # read one line at a time
'Hello file world!\n'
>>> file.readline()
'Bye   file world.\n'
>>> file.readline()                            # empty string at end-of-file
''

>>> file.seek(0)                               # read N (or remaining) chars/bytes
>>> file.read(1), file.read(8)                 # empty string at end-of-file
('H', 'ello fil')

All of these input methods let us be specific about how much to fetch. Here are a few rules of thumb about which to choose:

read() and readlines() load the entire file into memory all at once. That makes them handy for grabbing a file’s contents with as little code as possible. It also makes them generally fast, but costly in terms of memory for huge files—loading a multigigabyte file into memory is not generally a good thing to do (and might not be possible at all on a given computer).
On the other hand, because the readline() and read(N) calls fetch just part of the file (the next line or N-character-or-byte block), they are safer for potentially big files but a bit less convenient and sometimes slower. Both return an empty string when they reach end-of-file. If speed matters and your files aren’t huge, read or readlines may be a generally better choice.
See also the discussion of the newer file iterators in the next section. As we’ll see, iterators combine the convenience of readlines() with the space efficiency of readline() and are the preferred way to read text files by lines today.

The seek(0) call used repeatedly here means “go back to the start of the file.” In our example, it is an alternative to reopening the file each time. In files, all read and write operations take place at the current position; files normally start at offset 0 when opened and advance as data is transferred. The seek call simply lets us move to a new position for the next transfer operation. More on this method later when we explore random access files.

Reading lines with file iterators

In older versions of Python, the traditional way to read a file line by line in a for loop was to read the file into a list that could be stepped through as usual:

>>> file = open('data.txt')
>>> for line in file.readlines():    # DON'T DO THIS ANYMORE!
...     print(line, end='')

If you’ve already studied the core language using a first book like Learning Python, you may already know that this coding pattern is actually more work than is needed today—both for you and your computer’s memory. In recent Pythons, the file object includes an iterator which is smart enough to grab just one line per request in all iteration contexts, including for loops and list comprehensions. The practical benefit of this extension is that you no longer need to call readlines in a for loop to scan line by line—the iterator reads lines on request automatically:

>>> file = open('data.txt')
>>> for line in file:                  # no need to call readlines
...     print(line, end='')            # iterator reads next line each time
...
Hello file world!
Bye   file world.

Better still, you can open the file in the loop statement itself, as a temporary which will be automatically closed on garbage collection when the loop ends (that’s normally the file’s sole reference):

>>> for line in open('data.txt'):      # even shorter: temporary file object
...     print(line, end='')            # auto-closed when garbage collected
...
Hello file world!
Bye   file world.

Moreover, this file line-iterator form does not load the entire file into a line’s list all at once, so it will be more space efficient for large text files. Because of that, this is the prescribed way to read line by line today. If you want to see what really happens inside the for loop, you can use the iterator manually; it’s just a __next__ method (run by the next built-in function), which is similar to calling the readline method each time through, except that read methods return an empty string at end-of-file (EOF) and the iterator raises an exception to end the iteration:

>>> file = open('data.txt')      # read methods: empty at EOF
>>> file.readline()
'Hello file world!\n'
>>> file.readline()
'Bye   file world.\n'
>>> file.readline()
''

>>> file = open('data.txt')      # iterators: exception at EOF
>>> file.__next__()              # no need to call iter(file) first,
'Hello file world!\n'            # since files are their own iterator
>>> file.__next__()
'Bye   file world.\n'
>>> file.__next__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Interestingly, iterators are automatically used in all iteration contexts, including the list constructor call, list comprehension expressions, map calls, and in membership checks:

>>> open('data.txt').readlines()                             # always read lines
['Hello file world!\n', 'Bye   file world.\n']

>>> list(open('data.txt'))                                   # force line iteration
['Hello file world!\n', 'Bye   file world.\n']

>>> lines = [line.rstrip() for line in open('data.txt')]     # comprehension
>>> lines
['Hello file world!', 'Bye   file world.']

>>> lines = [line.upper() for line in open('data.txt')]      # arbitrary actions
>>> lines
['HELLO FILE WORLD!\n', 'BYE   FILE WORLD.\n']

>>> list(map(str.split, open('data.txt')))                   # apply a function
[['Hello', 'file', 'world!'], ['Bye', 'file', 'world.']]

>>> line = 'Hello file world!\n'
>>> line in open('data.txt')                                 # line membership
True

Iterators may seem somewhat implicit at first glance, but they’re representative of the many ways that Python makes developers’ lives easier over time.

Other open options

Besides the w and (default) r file open modes, most platforms support an a mode string, meaning “append.” In this output mode, write methods add data to the end of the file, and the open call will not erase the current contents of the file:

>>> file = open('data.txt', 'a')          # open in append mode: doesn't erase
>>> file.write('The Life of Brian')       # added at end of existing data
>>> file.close()
>>>
>>> open('data.txt').read()               # open and read entire file
'Hello file world!\nBye   file world.\nThe Life of Brian'

In fact, although most files are opened using the sorts of calls we just ran, open actually supports additional arguments for more specific processing needs, the first three of which are the most commonly used—the filename, the open mode, and a buffering specification. All but the first of these are optional: if omitted, the open mode argument defaults to r (input), and the buffering policy is to enable full buffering. For special needs, here are a few things you should know about these three open arguments:

Filename

As mentioned earlier, filenames can include an explicit directory path to refer to files in arbitrary places on your computer; if they do not, they are taken to be names relative to the current working directory (described in the prior chapter). In general, most filename forms you can type in your system shell will work in an open call. For instance, a relative filename argument r'..\temp\spam.txt' on Windows means spam.txt in the temp subdirectory of the current working directory’s parent—up one, and down to directory temp.

Open mode

The open function accepts other modes, too, some of which we’ll see at work later in this chapter: r+, w+, and a+ to open for reads and writes, and any mode string with a b to designate binary mode. For instance, mode r+ means both reads and writes are allowed on an existing file; w+ allows reads and writes but creates the file anew, erasing any prior content; rb and wb read and write data in binary mode without any translations; and wb+ and r+b both combine binary mode and input plus output. In general, the mode string defaults to r for read but can be w for write and a for append, and you may add a + for update, as well as a b or t for binary or text mode; order is largely irrelevant.

As we’ll see later in this chapter, the + modes are often used in conjunction with the file object’s seek method to achieve random read/write access. Regardless of mode, file contents are always strings in Python programs—read methods return a string, and we pass a string to write methods. As also described later, though, the mode string implies which type of string is used: str for text mode or bytes and other byte string types for binary mode.

Buffering policy

The open call also takes an optional third buffering policy argument which lets you control buffering for the file—the way that data is queued up before being transferred, to boost performance. If passed, 0 means file operations are unbuffered (data is transferred immediately, but allowed in binary modes only), 1 means they are line buffered, and any other positive value means to use a full buffering (which is the default, if no buffering argument is passed).

As usual, Python’s library manual and reference texts have the full story on additional open arguments beyond these three. For instance, the open call supports additional arguments related to the end-of-line mapping behavior and the automatic Unicode encoding of content performed for text-mode files. Since we’ll discuss both of these concepts in the next section, let’s move ahead.

Binary and Text Files

All of the preceding examples process simple text files, but Python scripts can also open and process files containing binary data—JPEG images, audio clips, packed binary data produced by FORTRAN and C programs, encoded text, and anything else that can be stored in files as bytes. The primary difference in terms of your code is the mode argument passed to the built-in open function:

>>> file = open('data.txt', 'wb')      # open binary output file
>>> file = open('data.txt', 'rb')      # open binary input file

Once you’ve opened binary files in this way, you may read and write their contents using the same methods just illustrated: read, write, and so on. The readline and readlines methods as well as the file’s line iterator still work here for text files opened in binary mode, but they don’t make sense for truly binary data that isn’t line oriented (end-of-line bytes are meaningless, if they appear at all).

In all cases, data transferred between files and your programs is represented as Python strings within scripts, even if it is binary data. For binary mode files, though, file content is represented as byte strings. Continuing with our text file from preceding examples:

>>> open('data.txt').read()                                   # text mode: str
'Hello file world!\nBye   file world.\nThe Life of Brian'

>>> open('data.txt', 'rb').read()                             # binary mode: bytes
b'Hello file world!\r\nBye   file world.\r\nThe Life of Brian'

>>> file = open('data.txt', 'rb')
>>> for line in file: print(line)
...
b'Hello file world!\r\n'
b'Bye   file world.\r\n'
b'The Life of Brian'

This occurs because Python 3.X treats text-mode files as Unicode, and automatically decodes content on input and encodes it on output. Binary mode files instead give us access to file content as raw byte strings, with no translation of content—they reflect exactly what is stored on the file. Because str strings are always Unicode text in 3.X, the special bytes string is required to represent binary data as a sequence of byte-size integers which may contain any 8-bit value. Because normal and byte strings have almost identical operation sets, many programs can largely take this on faith; but keep in mind that you really must open truly binary data in binary mode for input, because it will not generally be decodable as Unicode text.

Similarly, you must also supply byte strings for binary mode output—normal strings are not raw binary data, but are decoded Unicode characters (a.k.a. code points) which are encoded to binary on text-mode output:

>>> open('data.bin', 'wb').write(b'Spam\n')
5
>>> open('data.bin', 'rb').read()
b'Spam\n'

>>> open('data.bin', 'wb').write('spam\n')
TypeError: must be bytes or buffer, not str

But notice that this file’s line ends with just \n, instead of the Windows \r\n that showed up in the preceding example for the text file in binary mode. Strictly speaking, binary mode disables Unicode encoding translation, but it also prevents the automatic end-of-line character translation performed by text-mode files by default. Before we can understand this fully, though, we need to study the two main ways in which text files differ from binary.

Unicode encodings for text files

As mentioned earlier, text-mode file objects always translate data according to a default or provided Unicode encoding type, when the data is transferred to and from external file. Their content is encoded on files, but decoded in memory. Binary mode files don’t perform any such translation, which is what we want for truly binary data. For instance, consider the following string, which embeds a Unicode character whose binary value is outside the normal 7-bit range of the ASCII encoding standard:

>>> data = 'sp\xe4m'
>>> data
'späm'
>>> 0xe4, bin(0xe4), chr(0xe4)
(228, '0b11100100', 'ä')

It’s possible to manually encode this string according to a variety of Unicode encoding types—its raw binary byte string form is different under some encodings:

>>> data.encode('latin1')                  # 8-bit characters: ascii + extras
b'sp\xe4m'

>>> data.encode('utf8')                    # 2 bytes for special characters only
b'sp\xc3\xa4m'

>>> data.encode('ascii')                   # does not encode per ascii
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 2:
ordinal not in range(128)

Python displays printable characters in these strings normally, but nonprintable bytes show as \xNN hexadecimal escapes which become more prevalent under more sophisticated encoding schemes (cp500 in the following is an EBCDIC encoding):

>>> data.encode('utf16')                   # 2 bytes per character plus preamble
b'\xff\xfes\x00p\x00\xe4\x00m\x00'

>>> data.encode('cp500')                   # an ebcdic encoding: very different
b'\xa2\x97C\x94'

The encoded results here reflect the string’s raw binary form when stored in files. Manual encoding is usually unnecessary, though, because text files handle encodings automatically on data transfers—reads decode and writes encode, according to the encoding name passed in (or a default for the underlying platform: see sys.getdefaultencoding). Continuing our interactive session:

>>> open('data.txt', 'w', encoding='latin1').write(data)
4
>>> open('data.txt', 'r', encoding='latin1').read()
'späm'
>>> open('data.txt', 'rb').read()
b'sp\xe4m'

If we open in binary mode, though, no encoding translation occurs—the last command in the preceding example shows us what’s actually stored on the file. To see how file content differs for other encodings, let’s save the same string again:

>>> open('data.txt', 'w', encoding='utf8').write(data)        # encode data per utf8
4
>>> open('data.txt', 'r', encoding='utf8').read()             # decode: undo encoding
'späm'
>>> open('data.txt', 'rb').read()                             # no data translations
b'sp\xc3\xa4m'

This time, raw file content is different, but text mode’s auto-decoding makes the string the same by the time it’s read back by our script. Really, encodings pertain only to strings while they are in files; once they are loaded into memory, strings are simply sequences of Unicode characters (“code points”). This translation step is what we want for text files, but not for binary. Because binary modes skip the translation, you’ll want to use them for truly binary data. If fact, you usually must—trying to write unencodable data and attempting to read undecodable data is an error:

>>> open('data.txt', 'w', encoding='ascii').write(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 2:
ordinal not in range(128)

>>> open(r'C:\Python31\python.exe', 'r').read()
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2:
character maps to <undefined>

Binary mode is also a last resort for reading text files, if they cannot be decoded per the underlying platform’s default, and the encoding type is unknown—the following recreates the original strings if encoding type is known, but fails if it is not known unless binary mode is used (such failure may occur either on inputting the data or printing it, but it fails nevertheless):

>>> open('data.txt', 'w', encoding='cp500').writelines(['spam\n', 'ham\n'])
>>> open('data.txt', 'r', encoding='cp500').readlines()
['spam\n', 'ham\n']

>>> open('data.txt', 'r').readlines()
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2:
character maps to <undefined>

>>> open('data.txt', 'rb').readlines()
[b'\xa2\x97\x81\x94\r%\x88\x81\x94\r%']

>>> open('data.txt', 'rb').read()
b'\xa2\x97\x81\x94\r%\x88\x81\x94\r%'

If all your text is ASCII you generally can ignore encoding altogether; data in files maps directly to characters in strings, because ASCII is a subset of most platforms’ default encodings. If you must process files created with other encodings, and possibly on different platforms (obtained from the Web, for instance), binary mode may be required if encoding type is unknown. Keep in mind, however, that text in still-encoded binary form might not work as you expect: because it is encoded per a given encoding scheme, it might not accurately compare or combine with text encoded in other schemes.

Again, see other resources for more on the Unicode story. We’ll revisit the Unicode story at various points in this book, especially in Chapter 9, to see how it relates to the tkinter Text widget, and in Part IV, covering Internet programming, to learn what it means for data shipped over networks by protocols such as FTP, email, and the Web at large. Text files have another feature, though, which is similarly a nonfeature for binary data: line-end translations, the topic of the next section.

End-of-line translations for text files

For historical reasons, the end of a line of text in a file is represented by different characters on different platforms. It’s a single \n character on Unix-like platforms, but the two-character sequence \r\n on Windows. That’s why files moved between Linux and Windows may look odd in your text editor after transfer—they may still be stored using the original platform’s end-of-line convention.

For example, most Windows editors handle text in Unix format, but Notepad has been a notable exception—text files copied from Unix or Linux may look like one long line when viewed in Notepad, with strange characters inside (\n). Similarly, transferring a file from Windows to Unix in binary mode retains the \r characters (which often appear as ^M in text editors).

Python scripts that process text files don’t normally have to care, because the files object automatically maps the DOS \r\n sequence to a single \n. It works like this by default—when scripts are run on Windows:

For files opened in text mode, \r\n is translated to \n when input.
For files opened in text mode, \n is translated to \r\n when output.
For files opened in binary mode, no translation occurs on input or output.

On Unix-like platforms, no translations occur, because \n is used in files. You should keep in mind two important consequences of these rules. First, the end-of-line character for text-mode files is almost always represented as a single \n within Python scripts, regardless of how it is stored in external files on the underlying platform. By mapping to and from \n on input and output, Python hides the platform-specific difference.

The second consequence of the mapping is subtler: when processing binary files, binary open modes (e.g, rb, wb) effectively turn off line-end translations. If they did not, the translations listed previously could very well corrupt data as it is input or output—a random \r in data might be dropped on input, or added for a \n in the data on output. The net effect is that your binary data would be trashed when read and written—probably not quite what you want for your audio files and images!

This issue has become almost secondary in Python 3.X, because we generally cannot use binary data with text-mode files anyhow—because text-mode files automatically apply Unicode encodings to content, transfers will generally fail when the data cannot be decoded on input or encoded on output. Using binary mode avoids Unicode errors, and automatically disables line-end translations as well (Unicode error can be caught in try statements as well). Still, the fact that binary mode prevents end-of-line translations to protect file content is best noted as a separate feature, especially if you work in an ASCII-only world where Unicode encoding issues are irrelevant.

Here’s the end-of-line translation at work in Python 3.1 on Windows—text mode translates to and from the platform-specific line-end sequence so our scripts are portable:

>>> open('temp.txt', 'w').write('shrubbery\n')   # text output mode: \n -> \r\n
10
>>> open('temp.txt', 'rb').read()                # binary input: actual file bytes
b'shrubbery\r\n'
>>> open('temp.txt', 'r').read()                 # test input mode: \r\n -> \n
'shrubbery\n'

By contrast, writing data in binary mode prevents all translations as expected, even if the data happens to contain bytes that are part of line-ends in text mode (byte strings print their characters as ASCII if printable, else as hexadecimal escapes):

>>> data = b'a\0b\rc\r\nd'                       # 4 escape code bytes, 4 normal
>>> len(data)
8
>>> open('temp.bin', 'wb').write(data)           # write binary data to file as is
8
>>> open('temp.bin', 'rb').read()                # read as binary: no translation
b'a\x00b\rc\r\nd'

But reading binary data in text mode, whether accidental or not, can corrupt the data when transferred because of line-end translations (assuming it passes as decodable at all; ASCII bytes like these do on this Windows platform):

>>> open('temp.bin', 'r').read()                 # text mode read: botches \r !
'a\x00b\nc\nd'

Similarly, writing binary data in text mode can have as the same effect—line-end bytes may be changed or inserted (again, assuming the data is encodable per the platform’s default):

>>> open('temp.bin', 'w').write(data)            # must pass str for text mode
TypeError: must be str, not bytes                # use bytes.decode() for to-str
>>> data.decode()
'a\x00b\rc\r\nd'

>>> open('temp.bin', 'w').write(data.decode())
8
>>> open('temp.bin', 'rb').read()                # text mode write: added \r !
b'a\x00b\rc\r\r\nd'

>>> open('temp.bin', 'r').read()                 # again drops, alters \r on input
'a\x00b\nc\n\nd'

The short story to remember here is that you should generally use \n to refer to end-line in all your text file content, and you should always open binary data in binary file modes to suppress both end-of-line translations and any Unicode encodings. A file’s content generally determines its open mode, and file open modes usually process file content exactly as we want.

Keep in mind, though, that you might also need to use binary file modes for text in special contexts. For instance, in Chapter 6’s examples, we’ll sometimes open text files in binary mode to avoid possible Unicode decoding errors, for files generated on arbitrary platforms that may have been encoded in arbitrary ways. Doing so avoids encoding errors, but also can mean that some text might not work as expected—searches might not always be accurate when applied to such raw text, since the search key must be in bytes string formatted and encoded according to a specific and possibly incompatible encoding scheme.

In Chapter 11’s PyEdit, we’ll also need to catch Unicode exceptions in a “grep” directory file search utility, and we’ll go further to allow Unicode encodings to be specified for file content across entire trees. Moreover, a script that attempts to translate between different platforms’ end-of-line character conventions explicitly may need to read text in binary mode to retain the original line-end representation truly present in the file; in text mode, they would already be translated to \n by the time they reached the script.

It’s also possible to disable or further tailor end-of-line translations in text mode with additional open arguments we will finesse here. See the newline argument in open reference documentation for details; in short, passing an empty string to this argument also prevents line-end translation but retains other text-mode behavior. For this chapter, let’s turn next to two common use cases for binary data files: packed binary data and random access.

Parsing packed binary data with the struct module

By using the letter b in the open call, you can open binary datafiles in a platform-neutral way and read and write their content with normal file object methods. But how do you process binary data once it has been read? It will be returned to your script as a simple string of bytes, most of which are probably not printable characters.

If you just need to pass binary data along to another file or program, your work is done—for instance, simply pass the byte string to another file opened in binary mode. And if you just need to extract a number of bytes from a specific position, string slicing will do the job; you can even follow up with bitwise operations if you need to. To get at the contents of binary data in a structured way, though, as well as to construct its contents, the standard library struct module is a more powerful alternative.

The struct module provides calls to pack and unpack binary data, as though the data was laid out in a C-language struct declaration. It is also capable of composing and decomposing using any endian-ness you desire (endian-ness determines whether the most significant bits of binary numbers are on the left or right side). Building a binary datafile, for instance, is straightforward—pack Python values into a byte string and write them to a file. The format string here in the pack call means big-endian (>), with an integer, four-character string, half integer, and floating-point number:

>>> import struct
>>> data = struct.pack('>i4shf', 2, 'spam', 3, 1.234)
>>> data
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>> file = open('data.bin', 'wb')
>>> file.write(data)
14
>>> file.close()

Notice how the struct module returns a bytes string: we’re in the realm of binary data here, not text, and must use binary mode files to store. As usual, Python displays most of the packed binary data’s bytes here with \xNN hexadecimal escape sequences, because the bytes are not printable characters. To parse data like that which we just produced, read it off the file and pass it to the struct module with the same format string—you get back a tuple containing the values parsed out of the string and converted to Python objects:

>>> import struct
>>> file   = open('data.bin', 'rb')
>>> data   = file.read()
>>> values = struct.unpack('>>i4shf', data)
>>> values
(2, b'spam', 3, 1.2339999675750732)

Parsed-out strings are byte strings again, and we can apply string and bitwise operations to probe deeper:

>>> bin(values[0] | 0b1)                            # accessing bits and bytes
'0b11'
>>> values[1], list(values[1]), values[1][0]
(b'spam', [115, 112, 97, 109], 115)

Also note that slicing comes in handy in this domain; to grab just the four-character string in the middle of the packed binary data we just read, we can simply slice it out. Numeric values could similarly be sliced out and then passed to struct.unpack for conversion:

>>> bytes
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>> bytes[4:8]
b'spam'

>>> number = bytes[8:10]
>>> number
b'\x00\x03'
>>> struct.unpack('>h', number)
(3,)

Packed binary data crops up in many contexts, including some networking tasks, and in data produced by other programming languages. Because it’s not part of every programming job’s description, though, we’ll defer to the struct module’s entry in the Python library manual for more details.

Random access files

Binary files also typically see action in random access processing. Earlier, we mentioned that adding a + to the open mode string allows a file to be both read and written. This mode is typically used in conjunction with the file object’s seek method to support random read/write access. Such flexible file processing modes allow us to read bytes from one location, write to another, and so on. When scripts combine this with binary file modes, they may fetch and update arbitrary bytes within a file.

We used seek earlier to rewind files instead of closing and reopening. As mentioned, read and write operations always take place at the current position in the file; files normally start at offset 0 when opened and advance as data is transferred. The seek call lets us move to a new position for the next transfer operation by passing in a byte offset.

Python’s seek method also accepts an optional second argument that has one of three values—0 for absolute file positioning (the default); 1 to seek relative to the current position; and 2 to seek relative to the file’s end. That’s why passing just an offset of 0 to seek is roughly a file rewind operation: it repositions the file to its absolute start. In general, seek supports random access on a byte-offset basis. Seeking to a multiple of a record’s size in a binary file, for instance, allows us to fetch a record by its relative position.

Although you can use seek without + modes in open (e.g., to just read from random locations), it’s most flexible when combined with input/output files. And while you can perform random access in text mode, too, the fact that text modes perform Unicode encodings and line-end translations make them difficult to use when absolute byte offsets and lengths are required for seeks and reads—your data may look very different when stored in files. Text mode may also make your data nonportable to platforms with different default encodings, unless you’re willing to always specify an explicit encoding for opens. Except for simple unencoded ASCII text without line-ends, seek tends to works best with binary mode files.

To demonstrate, let’s create a file in w+b mode (equivalent to wb+) and write some data to it; this mode allows us to both read and write, but initializes the file to be empty if it’s already present (all w modes do). After writing some data, we seek back to file start to read its content (some integer return values are omitted in this example again for brevity):

>>> records = [bytes([char] * 8) for char in b'spam']
>>> records
[b'ssssssss', b'pppppppp', b'aaaaaaaa', b'mmmmmmmm']

>>> file = open('random.bin', 'w+b')
>>> for rec in records:                                   # write four records
...     size = file.write(rec)                            # bytes for binary mode
...
>>> file.flush()
>>> pos = file.seek(0)                                    # read entire file
>>> print(file.read())
b'ssssssssppppppppaaaaaaaammmmmmmm'

Now, let’s reopen our file in r+b mode; this mode allows both reads and writes again, but does not initialize the file to be empty. This time, we seek and read in multiples of the size of data items (“records”) stored, to both fetch and update them at random:

c:\temp> python
>>> file = open('random.bin', 'r+b')
>>> print(file.read())                           # read entire file
b'ssssssssppppppppaaaaaaaammmmmmmm'

>>> record = b'X' * 8
>>> file.seek(0)                                 # update first record
>>> file.write(record)
>>> file.seek(len(record) * 2)                   # update third record
>>> file.write(b'Y' * 8)

>>> file.seek(8)
>>> file.read(len(record))                       # fetch second record
b'pppppppp'
>>> file.read(len(record))                       # fetch next (third) record
b'YYYYYYYY'

>>> file.seek(0)                                 # read entire file
>>> file.read()
b'XXXXXXXXppppppppYYYYYYYYmmmmmmmm'

c:\temp> type random.bin                         # the view outside Python
XXXXXXXXppppppppYYYYYYYYmmmmmmmm

Finally, keep in mind that seek can be used to achieve random access, even if it’s just for input. The following seeks in multiples of record size to read (but not write) fixed-length records at random. Notice that it also uses r text mode: since this data is simple ASCII text bytes and has no line-ends, text and binary modes work the same on this platform:

c:\temp> python
>>> file = open('random.bin', 'r')        # text mode ok if no encoding/endlines
>>> reclen = 8
>>> file.seek(reclen * 3)                 # fetch record 4
>>> file.read(reclen)
'mmmmmmmm'
>>> file.seek(reclen * 1)                 # fetch record 2
>>> file.read(reclen)
'pppppppp'

>>> file = open('random.bin', 'rb')       # binary mode works the same here
>>> file.seek(reclen * 2)                 # fetch record 3
>>> file.read(reclen)                     # returns byte strings
b'YYYYYYYY'

But unless your file’s content is always a simple unencoded text form like ASCII and has no translated line-ends, text mode should not generally be used if you are going to seek—line-ends may be translated on Windows and Unicode encodings may make arbitrary transformations, both of which can make absolute seek offsets difficult to use. In the following, for example, the positions of characters after the first non-ASCII no longer match between the string in Python and its encoded representation on the file:

>>> data = 'sp\xe4m'                                 # data to your script
>>> data, len(data)                                  # 4 unicode chars, 1 nonascii
('späm', 4)
>>> data.encode('utf8'), len(data.encode('utf8'))    # bytes written to file
(b'sp\xc3\xa4m', 5)

>>> f = open('test', mode='w+', encoding='utf8')     # use text mode, encoded
>>> f.write(data)
>>> f.flush()
>>> f.seek(0); f.read(1)                             # ascii bytes work
's'
>>> f.seek(2); f.read(1)                             # as does 2-byte nonascii
'ä'
>>> data[3]                                          # but offset 3 is not 'm' !
'm'
>>> f.seek(3); f.read(1)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0:
unexpected code byte

As you can see, Python’s file modes provide flexible file processing for programs that require it. In fact, the os module offers even more file processing options, as the next section describes.

Lower-Level File Tools in the os Module

The os module contains an additional set of file-processing functions that are distinct from the built-in file object tools demonstrated in previous examples. For instance, here is a partial list of os file-related calls:

os.open( path, flags, mode ): Opens a file and returns its descriptor
os.read( descriptor, N ): Reads at most N bytes and returns a byte string
os.write( descriptor, string ): Writes bytes in byte string string to the file
os.lseek( descriptor, position , how ): Moves to position in the file

Technically, os calls process files by their descriptors, which are integer codes or “handles” that identify files in the operating system. Descriptor-based files deal in raw bytes, and have no notion of the line-end or Unicode translations for text that we studied in the prior section. In fact, apart from extras like buffering, descriptor-based files generally correspond to binary mode file objects, and we similarly read and write bytes strings, not str strings. However, because the descriptor-based file tools in os are lower level and more complex than the built-in file objects created with the built-in open function, you should generally use the latter for all but very special file-processing needs.^[9]

Using os.open files

To give you the general flavor of this tool set, though, let’s run a few interactive experiments. Although built-in file objects and os module descriptor files are processed with distinct tool sets, they are in fact related—the file system used by file objects simply adds a layer of logic on top of descriptor-based files.

In fact, the fileno file object method returns the integer descriptor associated with a built-in file object. For instance, the standard stream file objects have descriptors 0, 1, and 2; calling the os.write function to send data to stdout by descriptor has the same effect as calling the sys.stdout.write method:

>>> import sys
>>> for stream in (sys.stdin, sys.stdout, sys.stderr):
...     print(stream.fileno())
...
0
1
2

>>> sys.stdout.write('Hello stdio world\n')        # write via file method
Hello stdio world
18
>>> import os
>>> os.write(1, b'Hello descriptor world\n')       # write via os module
Hello descriptor world
23

Because file objects we open explicitly behave the same way, it’s also possible to process a given real external file on the underlying computer through the built-in open function, tools in the os module, or both (some integer return values are omitted here for brevity):

>>> file = open(r'C:\temp\spam.txt', 'w')       # create external file, object
>>> file.write('Hello stdio file\n')            # write via file object method
>>> file.flush()                                # else os.write to disk first!
>>> fd = file.fileno()                          # get descriptor from object
>>> fd
3
>>> import os
>>> os.write(fd, b'Hello descriptor file\n')    # write via os module
>>> file.close()

C:\temp> type spam.txt                          # lines from both schemes
Hello stdio file
Hello descriptor file

os.open mode flags

So why the extra file tools in os? In short, they give more low-level control over file processing. The built-in open function is easy to use, but it may be limited by the underlying filesystem that it uses, and it adds extra behavior that we do not want. The os module lets scripts be more specific—for example, the following opens a descriptor-based file in read-write and binary modes by performing a binary “or” on two mode flags exported by os:

>>> fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>> os.read(fdfile, 20)
b'Hello stdio file\r\nHe'

>>> os.lseek(fdfile, 0, 0)                        # go back to start of file
>>> os.read(fdfile, 100)                          # binary mode retains "\r\n"
b'Hello stdio file\r\nHello descriptor file\n'

>>> os.lseek(fdfile, 0, 0)
>>> os.write(fdfile, b'HELLO')                    # overwrite first 5 bytes
5

C:\temp> type spam.txt
HELLO stdio file
Hello descriptor file

In this case, binary mode strings rb+ and r+b in the basic open call are equivalent:

>>> file = open(r'C:\temp\spam.txt', 'rb+')       # same but with open/objects
>>> file.read(20)
b'HELLO stdio file\r\nHe'
>>> file.seek(0)
>>> file.read(100)
b'HELLO stdio file\r\nHello descriptor file\n'
>>> file.seek(0)
>>> file.write(b'Jello')
5
>>> file.seek(0)
>>> file.read()
b'Jello stdio file\r\nHello descriptor file\n'

But on some systems, os.open flags let us specify more advanced things like exclusive access (O_EXCL) and nonblocking modes (O_NONBLOCK) when a file is opened. Some of these flags are not portable across platforms (another reason to use built-in file objects most of the time); see the library manual or run a dir(os) call on your machine for an exhaustive list of other open flags available.

One final note here: using os.open with the O_EXCL flag is the most portable way to lock files for concurrent updates or other process synchronization in Python today. We’ll see contexts where this can matter in the next chapter, when we begin to explore multiprocessing tools. Programs running in parallel on a server machine, for instance, may need to lock files before performing updates, if multiple threads or processes might attempt such updates at the same time.

Wrapping descriptors in file objects

We saw earlier how to go from file object to field descriptor with the fileno file object method; given a descriptor, we can use os module tools for lower-level file access to the underlying file. We can also go the other way—the os.fdopen call wraps a file descriptor in a file object. Because conversions work both ways, we can generally use either tool set—file object or os module:

>>> fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>> fdfile
3
>>> objfile = os.fdopen(fdfile, 'rb')
>>> objfile.read()
b'Jello stdio file\r\nHello descriptor file\n'

In fact, we can wrap a file descriptor in either a binary or text-mode file object: in text mode, reads and writes perform the Unicode encodings and line-end translations we studied earlier and deal in str strings instead of bytes:

C:\...\PP4E\System> python
>>> import os
>>> fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>> objfile = os.fdopen(fdfile, 'r')
>>> objfile.read()
'Jello stdio file\nHello descriptor file\n'

In Python 3.X, the built-in open call also accepts a file descriptor instead of a file name string; in this mode it works much like os.fdopen, but gives you greater control—for example, you can use additional arguments to specify a nondefault Unicode encoding for text and suppress the default descriptor close. Really, though, os.fdopen accepts the same extra-control arguments in 3.X, because it has been redefined to do little but call back to the built-in open (see os.py in the standard library):

C:\...\PP4E\System> python
>>> import os
>>> fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>> fdfile
3
>>> objfile = open(fdfile, 'r', encoding='latin1', closefd=False)
>>> objfile.read()
'Jello stdio file\nHello descriptor file\n'

>>> objfile = os.fdopen(fdfile, 'r', encoding='latin1', closefd=True)
>>> objfile.seek(0)
>>> objfile.read()
'Jello stdio file\nHello descriptor file\n'

We’ll make use of this file object wrapper technique to simplify text-oriented pipes and other descriptor-like objects later in this book (e.g., sockets have a makefile method which achieves similar effects).

Other os module file tools

The os module also includes an assortment of file tools that accept a file pathname string and accomplish file-related tasks such as renaming (os.rename), deleting (os.remove), and changing the file’s owner and permission settings (os.chown, os.chmod). Let’s step through a few examples of these tools in action:

>>> os.chmod('spam.txt', 0o777)          # enable all accesses

This os.chmod file permissions call passes a 9-bit string composed of three sets of three bits each. From left to right, the three sets represent the file’s owning user, the file’s group, and all others. Within each set, the three bits reflect read, write, and execute access permissions. When a bit is “1” in this string, it means that the corresponding operation is allowed for the assessor. For instance, octal 0777 is a string of nine “1” bits in binary, so it enables all three kinds of accesses for all three user groups; octal 0600 means that the file can be read and written only by the user that owns it (when written in binary, 0600 octal is really bits 110 000 000).

This scheme stems from Unix file permission settings, but the call works on Windows as well. If it’s puzzling, see your system’s documentation (e.g., a Unix manpage) for chmod. Moving on:

>>> os.rename(r'C:\temp\spam.txt', r'C:\temp\eggs.txt')      # from, to

>>> os.remove(r'C:\temp\spam.txt')                           # delete file?
WindowsError: [Error 2] The system cannot find the file specified: 'C:\\temp\\...'

>>> os.remove(r'C:\temp\eggs.txt')

The os.rename call used here changes a file’s name; the os.remove file deletion call deletes a file from your system and is synonymous with os.unlink (the latter reflects the call’s name on Unix but was obscure to users of other platforms).^[10] The os module also exports the stat system call:

>>> open('spam.txt', 'w').write('Hello stat world\n')        # +1 for \r added
17
>>> import os
>>> info = os.stat(r'C:\temp\spam.txt')
>>> info
nt.stat_result(st_mode=33206, st_ino=0, st_dev=0, st_nlink=0, st_uid=0, st_gid=0,
st_size=18, st_atime=1267645806, st_mtime=1267646072, st_ctime=1267645806)

>>> info.st_mode, info.st_size                  # via named-tuple item attr names
(33206, 18)

>>> import stat
>>> info[stat.ST_MODE], info[stat.ST_SIZE]      # via stat module presets
(33206, 18)
>>> stat.S_ISDIR(info.st_mode), stat.S_ISREG(info.st_mode)
(False, True)

The os.stat call returns a tuple of values (really, in 3.X a special kind of tuple with named items) giving low-level information about the named file, and the stat module exports constants and functions for querying this information in a portable way. For instance, indexing an os.stat result on offset stat.ST_SIZE returns the file’s size, and calling stat.S_ISDIR with the mode item from an os.stat result checks whether the file is a directory. As shown earlier, though, both of these operations are available in the os.path module, too, so it’s rarely necessary to use os.stat except for low-level file queries:

>>> path = r'C:\temp\spam.txt'
>>> os.path.isdir(path), os.path.isfile(path), os.path.getsize(path)
(False, True, 18)

File Scanners

Before we leave our file tools survey, it’s time for something that performs a more tangible task and illustrates some of what we’ve learned so far. Unlike some shell-tool languages, Python doesn’t have an implicit file-scanning loop procedure, but it’s simple to write a general one that we can reuse for all time. The module in Example 4-1 defines a general file-scanning routine, which simply applies a passed-in Python function to each line in an external file.

Example 4-1. PP4E\System\Filetools\scanfile.py

def scanner(name, function):
    file = open(name, 'r')               # create a file object
    while True:
        line = file.readline()           # call file methods
        if not line: break               # until end-of-file
        function(line)                   # call a function object
    file.close()

The scanner function doesn’t care what line-processing function is passed in, and that accounts for most of its generality—it is happy to apply any single-argument function that exists now or in the future to all of the lines in a text file. If we code this module and put it in a directory on the module search path, we can use it any time we need to step through a file line by line. Example 4-2 is a client script that does simple line translations.

Example 4-2. PP4E\System\Filetools\commands.py

#!/usr/local/bin/python
from sys import argv
from scanfile import scanner
class UnknownCommand(Exception): pass

def processLine(line):                      # define a function
    if line[0] == '*':                      # applied to each line
        print("Ms.", line[1:-1])
    elif line[0] == '+':
        print("Mr.", line[1:-1])            # strip first and last char: \n
    else:
        raise UnknownCommand(line)          # raise an exception

filename = 'data.txt'
if len(argv) == 2: filename = argv[1]       # allow filename cmd arg
scanner(filename, processLine)              # start the scanner

The text file hillbillies.txt contains the following lines:

*Granny
+Jethro
*Elly May
+"Uncle Jed"

and our commands script could be run as follows:

C:\...\PP4E\System\Filetools> python commands.py hillbillies.txt
Ms. Granny
Mr. Jethro
Ms. Elly May
Mr. "Uncle Jed"

This works, but there are a variety of coding alternatives for both files, some of which may be better than those listed above. For instance, we could also code the command processor of Example 4-2 in the following way; especially if the number of command options starts to become large, such a data-driven approach may be more concise and easier to maintain than a large if statement with essentially redundant actions (if you ever have to change the way output lines print, you’ll have to change it in only one place with this form):

commands = {'*': 'Ms.', '+': 'Mr.'}     # data is easier to expand than code?

def processLine(line):
    try:
        print(commands[line[0]], line[1:-1])
    except KeyError:
        raise UnknownCommand(line)

The scanner could similarly be improved. As a rule of thumb, we can also usually speed things up by shifting processing from Python code to built-in tools. For instance, if we’re concerned with speed, we can probably make our file scanner faster by using the file’s line iterator to step through the file instead of the manual readline loop in Example 4-1 (though you’d have to time this with your Python to be sure):

def scanner(name, function):
    for line in open(name, 'r'):         # scan line by line
        function(line)                   # call a function object

And we can work more magic in Example 4-1 with the iteration tools like the map built-in function, the list comprehension expression, and the generator expression. Here are three minimalist’s versions; the for loop is replaced by map or a comprehension, and we let Python close the file for us when it is garbage collected or the script exits (these all build a temporary list of results along the way to run through their iterations, but this overhead is likely trivial for all but the largest of files):

def scanner(name, function):
    list(map(function, open(name, 'r')))

def scanner(name, function):
    [function(line) for line in open(name, 'r')]

def scanner(name, function):
    list(function(line) for line in open(name, 'r'))

File filters

The preceding works as planned, but what if we also want to change a file while scanning it? Example 4-3 shows two approaches: one uses explicit files, and the other uses the standard input/output streams to allow for redirection on the command line.

Example 4-3. PP4E\System\Filetools\filters.py

import sys

def filter_files(name, function):         # filter file through function
    input  = open(name, 'r')              # create file objects
    output = open(name + '.out', 'w')     # explicit output file too
    for line in input:
        output.write(function(line))      # write the modified line
    input.close()
    output.close()                        # output has a '.out' suffix

def filter_stream(function):              # no explicit files
    while True:                           # use standard streams
        line = sys.stdin.readline()       # or: input()
        if not line: break
        print(function(line), end='')     # or: sys.stdout.write()

if __name__ == '__main__':
    filter_stream(lambda line: line)      # copy stdin to stdout if run

Notice that the newer context managers feature discussed earlier could save us a few lines here in the file-based filter of Example 4-3, and also guarantee immediate file closures if the processing function fails with an exception:

def filter_files(name, function):
    with open(name, 'r') as input, open(name + '.out', 'w') as output:
        for line in input:
            output.write(function(line))      # write the modified line

And again, file object line iterators could simplify the stream-based filter’s code in this example as well:

def filter_stream(function):
    for line in sys.stdin:                    # read by lines automatically
        print(function(line), end='')

Since the standard streams are preopened for us, they’re often easier to use. When run standalone, it simply parrots stdin to stdout:

C:\...\PP4E\System\Filetools> filters.py < hillbillies.txt
*Granny
+Jethro
*Elly May
+"Uncle Jed"

But this module is also useful when imported as a library (clients provide the line-processing function):

>>> from filters import filter_files
>>> filter_files('hillbillies.txt', str.upper)
>>> print(open('hillbillies.txt.out').read())
*GRANNY
+JETHRO
*ELLY MAY
+"UNCLE JED"

We’ll see files in action often in the remainder of this book, especially in the more complete and functional system examples of Chapter 6. First though, we turn to tools for processing our files’ home.

Directory Tools

One of the more common tasks in the shell utilities domain is applying an operation to a set of files in a directory—a “folder” in Windows-speak. By running a script on a batch of files, we can automate (that is, script) tasks we might have to otherwise run repeatedly by hand.

For instance, suppose you need to search all of your Python files in a development directory for a global variable name (perhaps you’ve forgotten where it is used). There are many platform-specific ways to do this (e.g., the find and grep commands in Unix), but Python scripts that accomplish such tasks will work on every platform where Python works—Windows, Unix, Linux, Macintosh, and just about any other platform commonly used today. If you simply copy your script to any machine you wish to use it on, it will work regardless of which other tools are available there; all you need is Python. Moreover, coding such tasks in Python also allows you to perform arbitrary actions along the way—replacements, deletions, and whatever else you can code in the Python language.

Walking One Directory

The most common way to go about writing such tools is to first grab a list of the names of the files you wish to process, and then step through that list with a Python for loop or other iteration tool, processing each file in turn. The trick we need to learn here, then, is how to get such a directory list within our scripts. For scanning directories there are at least three options: running shell listing commands with os.popen, matching filename patterns with glob.glob, and getting directory listings with os.listdir. They vary in interface, result format, and portability.

Running shell listing commands with os.popen

How did you go about getting directory file listings before you heard of Python? If you’re new to shell tools programming, the answer may be “Well, I started a Windows file explorer and clicked on things,” but I’m thinking here in terms of less GUI-oriented command-line mechanisms.

On Unix, directory listings are usually obtained by typing ls in a shell; on Windows, they can be generated with a dir command typed in an MS-DOS console box. Because Python scripts may use os.popen to run any command line that we can type in a shell, they are the most general way to grab a directory listing inside a Python program. We met os.popen in the prior chapters; it runs a shell command string and gives us a file object from which we can read the command’s output. To illustrate, let’s first assume the following directory structures—I have both the usual dir and a Unix-like ls command from Cygwin on my Windows laptop:

c:\temp> dir /B
parts
PP3E
random.bin
spam.txt
temp.bin
temp.txt

c:\temp> c:\cygwin\bin\ls
PP3E  parts  random.bin  spam.txt  temp.bin  temp.txt

c:\temp> c:\cygwin\bin\ls parts
part0001  part0002  part0003  part0004

The parts and PP3E names are a nested subdirectory in C:\temp here (the latter is a copy of the prior edition’s examples tree, which I used occasionally in this text). Now, as we’ve seen, scripts can grab a listing of file and directory names at this level by simply spawning the appropriate platform-specific command line and reading its output (the text normally thrown up on the console window):

C:\temp> python
>>> import os
>>> os.popen('dir /B').readlines()
['parts\n', 'PP3E\n', 'random.bin\n', 'spam.txt\n', 'temp.bin\n', 'temp.txt\n']

Lines read from a shell command come back with a trailing end-of-line character, but it’s easy enough to slice it off; the os.popen result also gives us a line iterator just like normal files:

>>> for line in os.popen('dir /B'):
...     print(line[:-1])
...
parts
PP3E
random.bin
spam.txt
temp.bin
temp.txt

>>> lines = [line[:-1] for line in os.popen('dir /B')]
>>> lines
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']

For pipe objects, the effect of iterators may be even more useful than simply avoiding loading the entire result into memory all at once: readlines will always block the caller until the spawned program is completely finished, whereas the iterator might not.

The dir and ls commands let us be specific about filename patterns to be matched and directory names to be listed by using name patterns; again, we’re just running shell commands here, so anything you can type at a shell prompt goes:

 >>> os.popen('dir *.bin /B').readlines()
['random.bin\n', 'temp.bin\n']

>>> os.popen(r'c:\cygwin\bin\ls *.bin').readlines()
['random.bin\n', 'temp.bin\n']

>>> list(os.popen(r'dir parts /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']

>>> [fname for fname in os.popen(r'c:\cygwin\bin\ls parts')]
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']

These calls use general tools and work as advertised. As I noted earlier, though, the downsides of os.popen are that it requires using a platform-specific shell command and it incurs a performance hit to start up an independent program. In fact, different listing tools may sometimes produce different results:

>>> list(os.popen(r'dir parts\part* /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>
>>> list(os.popen(r'c:\cygwin\bin\ls parts/part*'))
['parts/part0001\n', 'parts/part0002\n', 'parts/part0003\n', 'parts/part0004\n']

The next two alternative techniques do better on both counts.

The glob module

The term globbing comes from the * wildcard character in filename patterns; per computing folklore, a * matches a “glob” of characters. In less poetic terms, globbing simply means collecting the names of all entries in a directory—files and subdirectories—whose names match a given filename pattern. In Unix shells, globbing expands filename patterns within a command line into all matching filenames before the command is ever run. In Python, we can do something similar by calling the glob.glob built-in—a tool that accepts a filename pattern to expand, and returns a list (not a generator) of matching file names:

>>> import glob
>>> glob.glob('*')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']

>>> glob.glob('*.bin')
['random.bin', 'temp.bin']

>>> glob.glob('parts')
['parts']

>>> glob.glob('parts/*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']

>>> glob.glob('parts\part*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']

The glob call accepts the usual filename pattern syntax used in shells: ? means any one character, * means any number of characters, and [] is a character selection set.^[11] The pattern should include a directory path if you wish to glob in something other than the current working directory, and the module accepts either Unix or DOS-style directory separators (/ or \). This call is implemented without spawning a shell command (it uses os.listdir, described in the next section) and so is likely to be faster and more portable and uniform across all Python platforms than the os.popen schemes shown earlier.

Technically speaking, glob is a bit more powerful than described so far. In fact, using it to list files in one directory is just one use of its pattern-matching skills. For instance, it can also be used to collect matching names across multiple directories, simply because each level in a passed-in directory path can be a pattern too:

>>> for path in glob.glob(r'PP3E\Examples\PP3E\*\s*.py'): print(path)
...
PP3E\Examples\PP3E\Lang\summer-alt.py
PP3E\Examples\PP3E\Lang\summer.py
PP3E\Examples\PP3E\PyTools\search_all.py

Here, we get back filenames from two different directories that match the s*.py pattern; because the directory name preceding this is a * wildcard, Python collects all possible ways to reach the base filenames. Using os.popen to spawn shell commands achieves the same effect, but only if the underlying shell or listing command does, too, and with possibly different result formats across tools and platforms.

The os.listdir call

The os module’s listdir call provides yet another way to collect filenames in a Python list. It takes a simple directory name string, not a filename pattern, and returns a list containing the names of all entries in that directory—both simple files and nested directories—for use in the calling script:

>>> import os
>>> os.listdir('.')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
>>> os.listdir(os.curdir)
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
>>> os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']

This, too, is done without resorting to shell commands and so is both fast and portable to all major Python platforms. The result is not in any particular order across platforms (but can be sorted with the list sort method or sorted built-in function); returns base filenames without their directory path prefixes; does not include names “.” or “..” if present; and includes names of both files and directories at the listed level.

To compare all three listing techniques, let’s run them here side by side on an explicit directory. They differ in some ways but are mostly just variations on a theme for this task—os.popen returns end-of-lines and may sort filenames on some platforms, glob.glob accepts a pattern and returns filenames with directory prefixes, and os.listdir takes a simple directory name and returns names without directory prefixes:

>>> os.popen('dir /b parts').readlines()
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']

>>> glob.glob(r'parts\*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']

>>> os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']

Of these three, glob and listdir are generally better options if you care about script portability and result uniformity, and listdir seems fastest in recent Python releases (but gauge its performance yourself—implementations may change over time).

Splitting and joining listing results

In the last example, I pointed out that glob returns names with directory paths, whereas listdir gives raw base filenames. For convenient processing, scripts often need to split glob results into base files or expand listdir results into full paths. Such translations are easy if we let the os.path module do all the work for us. For example, a script that intends to copy all files elsewhere will typically need to first split off the base filenames from glob results so that it can add different directory names on the front:

>>> dirname = r'C:\temp\parts'
>>>
>>> import glob
>>> for file in glob.glob(dirname + '/*'):
...     head, tail = os.path.split(file)
...     print(head, tail, '=>', ('C:\\Other\\' + tail))
...
C:\temp\parts part0001 => C:\Other\part0001
C:\temp\parts part0002 => C:\Other\part0002
C:\temp\parts part0003 => C:\Other\part0003
C:\temp\parts part0004 => C:\Other\part0004

Here, the names after the => represent names that files might be moved to. Conversely, a script that means to process all files in a different directory than the one it runs in will probably need to prepend listdir results with the target directory name before passing filenames on to other tools:

>>> import os
>>> for file in os.listdir(dirname):
...     print(dirname, file, '=>', os.path.join(dirname, file))
...
C:\temp\parts part0001 => C:\temp\parts\part0001
C:\temp\parts part0002 => C:\temp\parts\part0002
C:\temp\parts part0003 => C:\temp\parts\part0003
C:\temp\parts part0004 => C:\temp\parts\part0004

When you begin writing realistic directory processing tools of the sort we’ll develop in Chapter 6, you’ll find these calls to be almost habit.

Walking Directory Trees

You may have noticed that almost all of the techniques in this section so far return the names of files in only a single directory (globbing with more involved patterns is the only exception). That’s fine for many tasks, but what if you want to apply an operation to every file in every directory and subdirectory in an entire directory tree?

For instance, suppose again that we need to find every occurrence of a global name in our Python scripts. This time, though, our scripts are arranged into a module package: a directory with nested subdirectories, which may have subdirectories of their own. We could rerun our hypothetical single-directory searcher manually in every directory in the tree, but that’s tedious, error prone, and just plain not fun.

Luckily, in Python it’s almost as easy to process a directory tree as it is to inspect a single directory. We can either write a recursive routine to traverse the tree, or use a tree-walker utility built into the os module. Such tools can be used to search, copy, compare, and otherwise process arbitrary directory trees on any platform that Python runs on (and that’s just about everywhere).

The os.walk visitor

To make it easy to apply an operation to all files in a complete directory tree, Python comes with a utility that scans trees for us and runs code we provide at every directory along the way: the os.walk function is called with a directory root name and automatically walks the entire tree at root and below.

Operationally, os.walk is a generator function—at each directory in the tree, it yields a three-item tuple, containing the name of the current directory as well as lists of both all the files and all the subdirectories in the current directory. Because it’s a generator, its walk is usually run by a for loop (or other iteration tool); on each iteration, the walker advances to the next subdirectory, and the loop runs its code for the next level of the tree (for instance, opening and searching all the files at that level).

That description might sound complex the first time you hear it, but os.walk is fairly straightforward once you get the hang of it. In the following, for example, the loop body’s code is run for each directory in the tree rooted at the current working directory (.). Along the way, the loop simply prints the directory name and all the files at the current level after prepending the directory name. It’s simpler in Python than in English (I removed the PP3E subdirectory for this test to keep the output short):

>>> import os
>>> for (dirname, subshere, fileshere) in os.walk('.'):
...     print('[' + dirname + ']')
...     for fname in fileshere:
...         print(os.path.join(dirname, fname))        # handle one file
...
[.]
.\random.bin
.\spam.txt
.\temp.bin
.\temp.txt
[.\parts]
.\parts\part0001
.\parts\part0002
.\parts\part0003
.\parts\part0004

In other words, we’ve coded our own custom and easily changed recursive directory listing tool in Python. Because this may be something we would like to tweak and reuse elsewhere, let’s make it permanently available in a module file, as shown in Example 4-4, now that we’ve worked out the details interactively.

Example 4-4. PP4E\System\Filetools\lister_walk.py

"list file tree with os.walk"

import sys, os

def lister(root):                                           # for a root dir
    for (thisdir, subshere, fileshere) in os.walk(root):    # generate dirs in tree
        print('[' + thisdir + ']')
        for fname in fileshere:                             # print files in this dir
            path = os.path.join(thisdir, fname)             # add dir name prefix
            print(path)

if __name__ == '__main__':
    lister(sys.argv[1])                                     # dir name in cmdline

When packaged this way, the code can also be run from a shell command line. Here it is being launched with the root directory to be listed passed in as a command-line argument:

C:\...\PP4E\System\Filetools> python lister_walk.py C:\temp\test
[C:\temp\test]
C:\temp\test\random.bin
C:\temp\test\spam.txt
C:\temp\test\temp.bin
C:\temp\test\temp.txt
[C:\temp\test\parts]
C:\temp\test\parts\part0001
C:\temp\test\parts\part0002
C:\temp\test\parts\part0003
C:\temp\test\parts\part0004

Here’s a more involved example of os.walk in action. Suppose you have a directory tree of files and you want to find all Python source files within it that reference the mimetypes module we’ll study in Chapter 6. The following is one (albeit hardcoded and overly specific) way to accomplish this task:

>>> import os
>>> matches = []
>>> for (dirname, dirshere, fileshere) in os.walk(r'C:\temp\PP3E\Examples'):
...     for filename in fileshere:
...         if filename.endswith('.py'):
...             pathname = os.path.join(dirname, filename)
...             if 'mimetypes' in open(pathname).read():
...                 matches.append(pathname)
...
>>> for name in matches: print(name)
...
C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailParser.py
C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailSender.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat_modular.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\ftptools.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\uploadflat.py
C:\temp\PP3E\Examples\PP3E\System\Media\playfile.py

This code loops through all the files at each level, looking for files with .py at the end of their names and which contain the search string. When a match is found, its full name is appended to the results list object; alternatively, we could also simply build a list of all .py files and search each in a for loop after the walk. Since we’re going to code much more general solution to this type of problem in Chapter 6, though, we’ll let this stand for now.

If you want to see what’s really going on in the os.walk generator, call its __next__ method (or equivalently, pass it to the next built-in function) manually a few times, just as the for loop does automatically; each time, you advance to the next subdirectory in the tree:

>>> gen = os.walk(r'C:\temp\test')
>>> gen.__next__()
('C:\\temp\\test', ['parts'], ['random.bin', 'spam.txt', 'temp.bin', 'temp.txt'])
>>> gen.__next__()
('C:\\temp\\test\\parts', [], ['part0001', 'part0002', 'part0003', 'part0004'])
>>> gen.__next__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

The library manual documents os.walk further than we will here. For instance, it supports bottom-up instead of top-down walks with its optional topdown=False argument, and callers may prune tree branches by deleting names in the subdirectories lists of the yielded tuples.

Internally, the os.walk call generates filename lists at each level with the os.listdir call we met earlier, which collects both file and directory names in no particular order and returns them without their directory paths; os.walk segregates this list into subdirectories and files (technically, nondirectories) before yielding a result. Also note that walk uses the very same subdirectories list it yields to callers in order to later descend into subdirectories. Because lists are mutable objects that can be changed in place, if your code modifies the yielded subdirectory names list, it will impact what walk does next. For example, deleting directory names will prune traversal branches, and sorting the list will order the walk.

Recursive os.listdir traversals

The os.walk tool does the work of tree traversals for us; we simply provide loop code with task-specific logic. However, it’s sometimes more flexible and hardly any more work to do the walking ourselves. The following script recodes the directory listing script with a manual recursive traversal function (a function that calls itself to repeat its actions). The mylister function in Example 4-5 is almost the same as lister in Example 4-4 but calls os.listdir to generate file paths manually and calls itself recursively to descend into subdirectories.

Example 4-5. PP4E\System\Filetools\lister_recur.py

# list files in dir tree by recursion

import sys, os

def mylister(currdir):
    print('[' + currdir + ']')
    for file in os.listdir(currdir):              # list files here
        path = os.path.join(currdir, file)        # add dir path back
        if not os.path.isdir(path):
            print(path)
        else:
            mylister(path)                        # recur into subdirs

if __name__ == '__main__':
    mylister(sys.argv[1])                         # dir name in cmdline

As usual, this file can be both imported and called or run as a script, though the fact that its result is printed text makes it less useful as an imported component unless its output stream is captured by another program.

When run as a script, this file’s output is equivalent to that of Example 4-4, but not identical—unlike the os.walk version, our recursive walker here doesn’t order the walk to visit files before stepping into subdirectories. It could by looping through the filenames list twice (selecting files first), but as coded, the order is dependent on os.listdir results. For most use cases, the walk order would be irrelevant:

C:\...\PP4E\System\Filetools> python lister_recur.py C:\temp\test
[C:\temp\test]
[C:\temp\test\parts]
C:\temp\test\parts\part0001
C:\temp\test\parts\part0002
C:\temp\test\parts\part0003
C:\temp\test\parts\part0004
C:\temp\test\random.bin
C:\temp\test\spam.txt
C:\temp\test\temp.bin
C:\temp\test\temp.txt

We’ll make better use of most of this section’s techniques in later examples in Chapter 6 and in this book at large. For example, scripts for copying and comparing directory trees use the tree-walker techniques introduced here. Watch for these tools in action along the way. We’ll also code a find utility in Chapter 6 that combines the tree traversal of os.walk with the filename pattern expansion of glob.glob.

Handling Unicode Filenames in 3.X: listdir, walk, glob

Because all normal strings are Unicode in Python 3.X, the directory and file names generated by os.listdir, os.walk, and glob.glob so far in this chapter are technically Unicode strings. This can have some ramifications if your directories contain unusual names that might not decode properly.

Technically, because filenames may contain arbitrary text, the os.listdir works in two modes in 3.X: given a bytes argument, this function will return filenames as encoded byte strings; given a normal str string argument, it instead returns filenames as Unicode strings, decoded per the filesystem’s encoding scheme:

C:\...\PP4E\System\Filetools> python
>>> import os
>>> os.listdir('.')[:4]
['bigext-tree.py', 'bigpy-dir.py', 'bigpy-path.py', 'bigpy-tree.py']

>>> os.listdir(b'.')[:4]
[b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py', b'bigpy-tree.py']

The byte string version can be used if undecodable file names may be present. Because os.walk and glob.glob both work by calling os.listdir internally, they inherit this behavior by proxy. The os.walk tree walker, for example, calls os.listdir at each directory level; passing byte string arguments suppresses decoding and returns byte string results:

>>> for (dir, subs, files) in os.walk('..'): print(dir)
...
..
..\Environment
..\Filetools
..\Processes

>>> for (dir, subs, files) in os.walk(b'..'): print(dir)
...
b'..'
b'..\\Environment'
b'..\\Filetools'
b'..\\Processes'

The glob.glob tool similarly calls os.listdir internally before applying name patterns, and so also returns undecoded byte string names for byte string arguments:

>>> glob.glob('.\*')[:3]
['.\\bigext-out.txt', '.\\bigext-tree.py', '.\\bigpy-dir.py']
>>>
>>> glob.glob(b'.\*')[:3]
[b'.\\bigext-out.txt', b'.\\bigext-tree.py', b'.\\bigpy-dir.py']

Given a normal string name (as a command-line argument, for example), you can force the issue by converting to byte strings with manual encoding to suppress decoding:

>>> name = '.'
>>> os.listdir(name.encode())[:4]
[b'bigext-out.txt', b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py']

The upshot is that if your directories may contain names which cannot be decoded according to the underlying platform’s Unicode encoding scheme, you may need to pass byte strings to these tools to avoid Unicode encoding errors. You’ll get byte strings back, which may be less readable if printed, but you’ll avoid errors while traversing directories and files.

This might be especially useful on systems that use simple encodings such as ASCII or Latin-1, but may contain files with arbitrarily encoded names from cross-machine copies, the Web, and so on. Depending upon context, exception handlers may be used to suppress some types of encoding errors as well.

We’ll see an example of how this can matter in the first section of Chapter 6, where an undecodable directory name generates an error if printed during a full disk scan (although that specific error seems more related to printing than to decoding in general).

Note that the basic open built-in function allows the name of the file being opened to be passed as either Unicode str or raw bytes, too, though this is used only to name the file initially; the additional mode argument determines whether the file’s content is handled in text or binary modes. Passing a byte string filename allows you to name files with arbitrarily encoded names.

Unicode policies: File content versus file names

In fact, it’s important to keep in mind that there are two different Unicode concepts related to files: the encoding of file content and the encoding of file name. Python provides your platform’s defaults for these settings in two different attributes; on Windows 7:

>>> import sys
>>> sys.getdefaultencoding()          # file content encoding, platform default
'utf-8'
>>> sys.getfilesystemencoding()       # file name encoding, platform scheme
'mbcs'

These settings allow you to be explicit when needed—the content encoding is used when data is read and written to the file, and the name encoding is used when dealing with names prior to transferring data. In addition, using bytes for file name tools may work around incompatibilities with the underlying file system’s scheme, and opening files in binary mode can suppress Unicode decoding errors for content.

As we’ve seen, though, opening text files in binary mode may also mean that the raw and still-encoded text will not match search strings as expected: search strings must also be byte strings encoded per a specific and possibly incompatible encoding scheme. In fact, this approach essentially mimics the behavior of text files in Python 2.X, and underscores why elevating Unicode in 3.X is generally desirable—such text files sometimes may appear to work even though they probably shouldn’t. On the other hand, opening text in binary mode to suppress Unicode content decoding and avoid decoding errors might still be useful if you do not wish to skip undecodable files and content is largely irrelevant.

As a rule of thumb, you should try to always provide an encoding name for text content if it might be outside the platform default, and you should rely on the default Unicode API for file names in most cases. Again, see Python’s manuals for more on the Unicode file name story than we have space to cover fully here, and see Learning Python, Fourth Edition, for more on Unicode in general.

In Chapter 6, we’re going to put the tools we met in this chapter to realistic use. For example, we’ll apply file and directory tools to implement file splitters, testing systems, directory copies and compares, and a variety of utilities based on tree walking. We’ll find that Python’s directory tools we met here have an enabling quality that allows us to automate a large set of real-world tasks. First, though, Chapter 5 concludes our basic tool survey, by exploring another system topic that tends to weave its way into a wide variety of application domains—parallel processing in Python.

^[9]For instance, to process pipes, described in Chapter 5. The Python os.pipe call returns two file descriptors, which can be processed with os module file tools or wrapped in a file object with os.fdopen. When used with descriptor-based file tools in os, pipes deal in byte strings, not text. Some device files may require lower-level control as well.

^[10]For related tools, see also the shutil module in Python’s standard library; it has higher-level tools for copying and removing files and more. We’ll also write directory compare, copy, and search tools of our own in Chapter 6, after we’ve had a chance to study the directory tools presented later in this chapter.

^[11]In fact, glob just uses the standard fnmatch module to match name patterns; see the fnmatch description in Chapter 6’s find module example for more details.

Get Programming Python, 4th Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Programming Python, 4th Edition by Mark Lutz