You are previewing Programming Python, 4th Edition.

Programming Python, 4th Edition

Cover of Programming Python, 4th Edition by Mark Lutz Published by O'Reilly Media, Inc.
  1. Programming Python
  2. A Note Regarding Supplemental Files
  3. Preface
    1. “And Now for Something Completely Different…”
    2. About This Book
      1. This Book’s Ecosystem
      2. What This Book Is Not
    3. About This Fourth Edition
      1. Specific Changes in This Edition
    4. What’s Left, Then?
    5. Python 3.X Impacts on This Book
      1. Specific 3.X Changes
      2. Language Versus Library: Unicode
      3. Python 3.1 Limitations: Email, CGI
    6. Using Book Examples
      1. Where to Look for Examples and Updates
      2. Example Portability
      3. Demo Launchers
      4. Code Reuse Policies
    7. Contacting O’Reilly
    8. Conventions Used in This Book
    9. Acknowledgments
  4. I. The Beginning
    1. 1. A Sneak Preview
      1. “Programming Python: The Short Story”
      2. The Task
      3. Step 1: Representing Records
      4. Step 2: Storing Records Persistently
      5. Step 3: Stepping Up to OOP
      6. Step 4: Adding Console Interaction
      7. Step 5: Adding a GUI
      8. Step 6: Adding a Web Interface
      9. The End of the Demo
  5. II. System Programming
    1. 2. System Tools
      1. “The os.path to Knowledge”
      2. System Scripting Overview
      3. Introducing the sys Module
      4. Introducing the os Module
    2. 3. Script Execution Context
      1. “I’d Like to Have an Argument, Please”
      2. Current Working Directory
      3. Command-Line Arguments
      4. Shell Environment Variables
      5. Standard Streams
    3. 4. File and Directory Tools
      1. “Erase Your Hard Drive in Five Easy Steps!”
      2. File Tools
      3. Directory Tools
    4. 5. Parallel System Tools
      1. “Telling the Monkeys What to Do”
      2. Forking Processes
      3. Threads
      4. Program Exits
      5. Interprocess Communication
      6. The multiprocessing Module
      7. Other Ways to Start Programs
      8. A Portable Program-Launch Framework
      9. Other System Tools Coverage
    5. 6. Complete System Programs
      1. “The Greps of Wrath”
      2. A Quick Game of “Find the Biggest Python File”
      3. Splitting and Joining Files
      4. Generating Redirection Web Pages
      5. A Regression Test Script
      6. Copying Directory Trees
      7. Comparing Directory Trees
      8. Searching Directory Trees
      9. Visitor: Walking Directories “++”
      10. Playing Media Files
      11. Automated Program Launchers (External)
  6. III. GUI Programming
    1. 7. Graphical User Interfaces
      1. “Here’s Looking at You, Kid”
      2. Python GUI Development Options
      3. tkinter Overview
      4. Climbing the GUI Learning Curve
      5. tkinter Coding Alternatives
      6. Adding Buttons and Callbacks
      7. Adding User-Defined Callback Handlers
      8. Adding Multiple Widgets
      9. Customizing Widgets with Classes
      10. Reusable GUI Components with Classes
      11. The End of the Tutorial
      12. Python/tkinter for Tcl/Tk Converts
    2. 8. A tkinter Tour, Part 1
      1. “Widgets and Gadgets and GUIs, Oh My!”
      2. Configuring Widget Appearance
      3. Top-Level Windows
      4. Dialogs
      5. Binding Events
      6. Message and Entry
      7. Checkbutton, Radiobutton, and Scale
      8. Running GUI Code Three Ways
      9. Images
      10. Viewing and Processing Images with PIL
    3. 9. A tkinter Tour, Part 2
      1. “On Today’s Menu: Spam, Spam, and Spam”
      2. Menus
      3. Listboxes and Scrollbars
      4. Text
      5. Canvas
      6. Grids
      7. Time Tools, Threads, and Animation
      8. The End of the Tour
    4. 10. GUI Coding Techniques
      1. “Building a Better Mousetrap”
      2. GuiMixin: Common Tool Mixin Classes
      3. GuiMaker: Automating Menus and Toolbars
      4. ShellGui: GUIs for Command-Line Tools
      5. GuiStreams: Redirecting Streams to Widgets
      6. Reloading Callback Handlers Dynamically
      7. Wrapping Up Top-Level Window Interfaces
      8. GUIs, Threads, and Queues
      9. More Ways to Add GUIs to Non-GUI Code
      10. The PyDemos and PyGadgets Launchers
    5. 11. Complete GUI Programs
      1. “Python, Open Source, and Camaros”
      2. PyEdit: A Text Editor Program/Object
      3. PyPhoto: An Image Viewer and Resizer
      4. PyView: An Image and Notes Slideshow
      5. PyDraw: Painting and Moving Graphics
      6. PyClock: An Analog/Digital Clock Widget
      7. PyToe: A Tic-Tac-Toe Game Widget
      8. Where to Go from Here
  7. IV. Internet Programming
    1. 12. Network Scripting
      1. “Tune In, Log On, and Drop Out”
      2. Python Internet Development Options
      3. Plumbing the Internet
      4. Socket Programming
      5. Handling Multiple Clients
      6. Making Sockets Look Like Files and Streams
      7. A Simple Python File Server
    2. 13. Client-Side Scripting
      1. “Socket to Me!”
      2. FTP: Transferring Files over the Net
      3. Transferring Files with ftplib
      4. Transferring Directories with ftplib
      5. Transferring Directory Trees with ftplib
      6. Processing Internet Email
      7. POP: Fetching Email
      8. SMTP: Sending Email
      9. email: Parsing and Composing Mail Content
      10. A Console-Based Email Client
      11. The mailtools Utility Package
      12. NNTP: Accessing Newsgroups
      13. HTTP: Accessing Websites
      14. The urllib Package Revisited
      15. Other Client-Side Scripting Options
    3. 14. The PyMailGUI Client
      1. “Use the Source, Luke”
      2. Major PyMailGUI Changes
      3. A PyMailGUI Demo
      4. PyMailGUI Implementation
      5. Ideas for Improvement
    4. 15. Server-Side Scripting
      1. “Oh, What a Tangled Web We Weave”
      2. What’s a Server-Side CGI Script?
      3. Running Server-Side Examples
      4. Climbing the CGI Learning Curve
      5. Saving State Information in CGI Scripts
      6. The Hello World Selector
      7. Refactoring Code for Maintainability
      8. More on HTML and URL Escapes
      9. Transferring Files to Clients and Servers
    5. 16. The PyMailCGI Server
      1. “Things to Do When Visiting Chicago”
      2. The PyMailCGI Website
      3. The Root Page
      4. Sending Mail by SMTP
      5. Reading POP Email
      6. Processing Fetched Mail
      7. Utility Modules
      8. Web Scripting Trade-Offs
  8. V. Tools and Techniques
    1. 17. Databases and Persistence
      1. “Give Me an Order of Persistence, but Hold the Pickles”
      2. Persistence Options in Python
      3. DBM Files
      4. Pickled Objects
      5. Shelve Files
      6. The ZODB Object-Oriented Database
      7. SQL Database Interfaces
      8. ORMs: Object Relational Mappers
      9. PyForm: A Persistent Object Viewer (External)
    2. 18. Data Structures
      1. “Roses Are Red, Violets Are Blue; Lists Are Mutable, and So Is Set Foo”
      2. Implementing Stacks
      3. Implementing Sets
      4. Subclassing Built-in Types
      5. Binary Search Trees
      6. Graph Searching
      7. Permuting Sequences
      8. Reversing and Sorting Sequences
      9. PyTree: A Generic Tree Object Viewer
    3. 19. Text and Language
      1. “See Jack Hack. Hack, Jack, Hack”
      2. Strategies for Processing Text in Python
      3. String Method Utilities
      4. Regular Expression Pattern Matching
      5. XML and HTML Parsing
      6. Advanced Language Tools
      7. Custom Language Parsers
      8. PyCalc: A Calculator Program/Object
    4. 20. Python/C Integration
      1. “I Am Lost at C”
      2. Extending Python in C: Overview
      3. A Simple C Extension Module
      4. The SWIG Integration Code Generator
      5. Wrapping C Environment Calls
      6. Wrapping C++ Classes with SWIG
      7. Other Extending Tools
      8. Embedding Python in C: Overview
      9. Basic Embedding Techniques
      10. Registering Callback Handler Objects
      11. Using Python Classes in C
      12. Other Integration Topics
  9. VI. The End
    1. 21. Conclusion: Python and the Development Cycle
      1. “That’s the End of the Book, Now Here’s the Meaning of Life”
      2. “Something’s Wrong with the Way We Program Computers”
      3. The “Gilligan Factor”
      4. Doing the Right Thing
      5. Enter Python
      6. But What About That Bottleneck?
      7. On Sinking the Titanic
      8. “So What’s Python?”: The Sequel
      9. In the Final Analysis…
  10. Index
  11. About the Author
  12. Colophon
  13. Copyright

Directory Tools

One of the more common tasks in the shell utilities domain is applying an operation to a set of files in a directory—a “folder” in Windows-speak. By running a script on a batch of files, we can automate (that is, script) tasks we might have to otherwise run repeatedly by hand.

For instance, suppose you need to search all of your Python files in a development directory for a global variable name (perhaps you’ve forgotten where it is used). There are many platform-specific ways to do this (e.g., the find and grep commands in Unix), but Python scripts that accomplish such tasks will work on every platform where Python works—Windows, Unix, Linux, Macintosh, and just about any other platform commonly used today. If you simply copy your script to any machine you wish to use it on, it will work regardless of which other tools are available there; all you need is Python. Moreover, coding such tasks in Python also allows you to perform arbitrary actions along the way—replacements, deletions, and whatever else you can code in the Python language.

Walking One Directory

The most common way to go about writing such tools is to first grab a list of the names of the files you wish to process, and then step through that list with a Python for loop or other iteration tool, processing each file in turn. The trick we need to learn here, then, is how to get such a directory list within our scripts. For scanning directories there are at least three options: running shell listing commands with os.popen, matching filename patterns with glob.glob, and getting directory listings with os.listdir. They vary in interface, result format, and portability.

Running shell listing commands with os.popen

How did you go about getting directory file listings before you heard of Python? If you’re new to shell tools programming, the answer may be “Well, I started a Windows file explorer and clicked on things,” but I’m thinking here in terms of less GUI-oriented command-line mechanisms.

On Unix, directory listings are usually obtained by typing ls in a shell; on Windows, they can be generated with a dir command typed in an MS-DOS console box. Because Python scripts may use os.popen to run any command line that we can type in a shell, they are the most general way to grab a directory listing inside a Python program. We met os.popen in the prior chapters; it runs a shell command string and gives us a file object from which we can read the command’s output. To illustrate, let’s first assume the following directory structures—I have both the usual dir and a Unix-like ls command from Cygwin on my Windows laptop:

c:\temp> dir /B

c:\temp> c:\cygwin\bin\ls
PP3E  parts  random.bin  spam.txt  temp.bin  temp.txt

c:\temp> c:\cygwin\bin\ls parts
part0001  part0002  part0003  part0004

The parts and PP3E names are a nested subdirectory in C:\temp here (the latter is a copy of the prior edition’s examples tree, which I used occasionally in this text). Now, as we’ve seen, scripts can grab a listing of file and directory names at this level by simply spawning the appropriate platform-specific command line and reading its output (the text normally thrown up on the console window):

C:\temp> python
>>> import os
>>> os.popen('dir /B').readlines()
['parts\n', 'PP3E\n', 'random.bin\n', 'spam.txt\n', 'temp.bin\n', 'temp.txt\n']

Lines read from a shell command come back with a trailing end-of-line character, but it’s easy enough to slice it off; the os.popen result also gives us a line iterator just like normal files:

>>> for line in os.popen('dir /B'):
...     print(line[:-1])

>>> lines = [line[:-1] for line in os.popen('dir /B')]
>>> lines
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']

For pipe objects, the effect of iterators may be even more useful than simply avoiding loading the entire result into memory all at once: readlines will always block the caller until the spawned program is completely finished, whereas the iterator might not.

The dir and ls commands let us be specific about filename patterns to be matched and directory names to be listed by using name patterns; again, we’re just running shell commands here, so anything you can type at a shell prompt goes:

 >>> os.popen('dir *.bin /B').readlines()
['random.bin\n', 'temp.bin\n']

>>> os.popen(r'c:\cygwin\bin\ls *.bin').readlines()
['random.bin\n', 'temp.bin\n']

>>> list(os.popen(r'dir parts /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']

>>> [fname for fname in os.popen(r'c:\cygwin\bin\ls parts')]
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']

These calls use general tools and work as advertised. As I noted earlier, though, the downsides of os.popen are that it requires using a platform-specific shell command and it incurs a performance hit to start up an independent program. In fact, different listing tools may sometimes produce different results:

>>> list(os.popen(r'dir parts\part* /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>> list(os.popen(r'c:\cygwin\bin\ls parts/part*'))
['parts/part0001\n', 'parts/part0002\n', 'parts/part0003\n', 'parts/part0004\n']

The next two alternative techniques do better on both counts.

The glob module

The term globbing comes from the * wildcard character in filename patterns; per computing folklore, a * matches a “glob” of characters. In less poetic terms, globbing simply means collecting the names of all entries in a directory—files and subdirectories—whose names match a given filename pattern. In Unix shells, globbing expands filename patterns within a command line into all matching filenames before the command is ever run. In Python, we can do something similar by calling the glob.glob built-in—a tool that accepts a filename pattern to expand, and returns a list (not a generator) of matching file names:

>>> import glob
>>> glob.glob('*')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']

>>> glob.glob('*.bin')
['random.bin', 'temp.bin']

>>> glob.glob('parts')

>>> glob.glob('parts/*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']

>>> glob.glob('parts\part*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']

The glob call accepts the usual filename pattern syntax used in shells: ? means any one character, * means any number of characters, and [] is a character selection set.[11] The pattern should include a directory path if you wish to glob in something other than the current working directory, and the module accepts either Unix or DOS-style directory separators (/ or \). This call is implemented without spawning a shell command (it uses os.listdir, described in the next section) and so is likely to be faster and more portable and uniform across all Python platforms than the os.popen schemes shown earlier.

Technically speaking, glob is a bit more powerful than described so far. In fact, using it to list files in one directory is just one use of its pattern-matching skills. For instance, it can also be used to collect matching names across multiple directories, simply because each level in a passed-in directory path can be a pattern too:

>>> for path in glob.glob(r'PP3E\Examples\PP3E\*\s*.py'): print(path)

Here, we get back filenames from two different directories that match the s*.py pattern; because the directory name preceding this is a * wildcard, Python collects all possible ways to reach the base filenames. Using os.popen to spawn shell commands achieves the same effect, but only if the underlying shell or listing command does, too, and with possibly different result formats across tools and platforms.

The os.listdir call

The os module’s listdir call provides yet another way to collect filenames in a Python list. It takes a simple directory name string, not a filename pattern, and returns a list containing the names of all entries in that directory—both simple files and nested directories—for use in the calling script:

>>> import os
>>> os.listdir('.')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>> os.listdir(os.curdir)
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>> os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']

This, too, is done without resorting to shell commands and so is both fast and portable to all major Python platforms. The result is not in any particular order across platforms (but can be sorted with the list sort method or sorted built-in function); returns base filenames without their directory path prefixes; does not include names “.” or “..” if present; and includes names of both files and directories at the listed level.

To compare all three listing techniques, let’s run them here side by side on an explicit directory. They differ in some ways but are mostly just variations on a theme for this task—os.popen returns end-of-lines and may sort filenames on some platforms, glob.glob accepts a pattern and returns filenames with directory prefixes, and os.listdir takes a simple directory name and returns names without directory prefixes:

>>> os.popen('dir /b parts').readlines()
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']

>>> glob.glob(r'parts\*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']

>>> os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']

Of these three, glob and listdir are generally better options if you care about script portability and result uniformity, and listdir seems fastest in recent Python releases (but gauge its performance yourself—implementations may change over time).

Splitting and joining listing results

In the last example, I pointed out that glob returns names with directory paths, whereas listdir gives raw base filenames. For convenient processing, scripts often need to split glob results into base files or expand listdir results into full paths. Such translations are easy if we let the os.path module do all the work for us. For example, a script that intends to copy all files elsewhere will typically need to first split off the base filenames from glob results so that it can add different directory names on the front:

>>> dirname = r'C:\temp\parts'
>>> import glob
>>> for file in glob.glob(dirname + '/*'):
...     head, tail = os.path.split(file)
...     print(head, tail, '=>', ('C:\\Other\\' + tail))
C:\temp\parts part0001 => C:\Other\part0001
C:\temp\parts part0002 => C:\Other\part0002
C:\temp\parts part0003 => C:\Other\part0003
C:\temp\parts part0004 => C:\Other\part0004

Here, the names after the => represent names that files might be moved to. Conversely, a script that means to process all files in a different directory than the one it runs in will probably need to prepend listdir results with the target directory name before passing filenames on to other tools:

>>> import os
>>> for file in os.listdir(dirname):
...     print(dirname, file, '=>', os.path.join(dirname, file))
C:\temp\parts part0001 => C:\temp\parts\part0001
C:\temp\parts part0002 => C:\temp\parts\part0002
C:\temp\parts part0003 => C:\temp\parts\part0003
C:\temp\parts part0004 => C:\temp\parts\part0004

When you begin writing realistic directory processing tools of the sort we’ll develop in Chapter 6, you’ll find these calls to be almost habit.

Walking Directory Trees

You may have noticed that almost all of the techniques in this section so far return the names of files in only a single directory (globbing with more involved patterns is the only exception). That’s fine for many tasks, but what if you want to apply an operation to every file in every directory and subdirectory in an entire directory tree?

For instance, suppose again that we need to find every occurrence of a global name in our Python scripts. This time, though, our scripts are arranged into a module package: a directory with nested subdirectories, which may have subdirectories of their own. We could rerun our hypothetical single-directory searcher manually in every directory in the tree, but that’s tedious, error prone, and just plain not fun.

Luckily, in Python it’s almost as easy to process a directory tree as it is to inspect a single directory. We can either write a recursive routine to traverse the tree, or use a tree-walker utility built into the os module. Such tools can be used to search, copy, compare, and otherwise process arbitrary directory trees on any platform that Python runs on (and that’s just about everywhere).

The os.walk visitor

To make it easy to apply an operation to all files in a complete directory tree, Python comes with a utility that scans trees for us and runs code we provide at every directory along the way: the os.walk function is called with a directory root name and automatically walks the entire tree at root and below.

Operationally, os.walk is a generator function—at each directory in the tree, it yields a three-item tuple, containing the name of the current directory as well as lists of both all the files and all the subdirectories in the current directory. Because it’s a generator, its walk is usually run by a for loop (or other iteration tool); on each iteration, the walker advances to the next subdirectory, and the loop runs its code for the next level of the tree (for instance, opening and searching all the files at that level).

That description might sound complex the first time you hear it, but os.walk is fairly straightforward once you get the hang of it. In the following, for example, the loop body’s code is run for each directory in the tree rooted at the current working directory (.). Along the way, the loop simply prints the directory name and all the files at the current level after prepending the directory name. It’s simpler in Python than in English (I removed the PP3E subdirectory for this test to keep the output short):

>>> import os
>>> for (dirname, subshere, fileshere) in os.walk('.'):
...     print('[' + dirname + ']')
...     for fname in fileshere:
...         print(os.path.join(dirname, fname))        # handle one file

In other words, we’ve coded our own custom and easily changed recursive directory listing tool in Python. Because this may be something we would like to tweak and reuse elsewhere, let’s make it permanently available in a module file, as shown in Example 4-4, now that we’ve worked out the details interactively.

Example 4-4. PP4E\System\Filetools\

"list file tree with os.walk"

import sys, os

def lister(root):                                           # for a root dir
    for (thisdir, subshere, fileshere) in os.walk(root):    # generate dirs in tree
        print('[' + thisdir + ']')
        for fname in fileshere:                             # print files in this dir
            path = os.path.join(thisdir, fname)             # add dir name prefix

if __name__ == '__main__':
    lister(sys.argv[1])                                     # dir name in cmdline

When packaged this way, the code can also be run from a shell command line. Here it is being launched with the root directory to be listed passed in as a command-line argument:

C:\...\PP4E\System\Filetools> python C:\temp\test

Here’s a more involved example of os.walk in action. Suppose you have a directory tree of files and you want to find all Python source files within it that reference the mimetypes module we’ll study in Chapter 6. The following is one (albeit hardcoded and overly specific) way to accomplish this task:

>>> import os
>>> matches = []
>>> for (dirname, dirshere, fileshere) in os.walk(r'C:\temp\PP3E\Examples'):
...     for filename in fileshere:
...         if filename.endswith('.py'):
...             pathname = os.path.join(dirname, filename)
...             if 'mimetypes' in open(pathname).read():
...                 matches.append(pathname)
>>> for name in matches: print(name)

This code loops through all the files at each level, looking for files with .py at the end of their names and which contain the search string. When a match is found, its full name is appended to the results list object; alternatively, we could also simply build a list of all .py files and search each in a for loop after the walk. Since we’re going to code much more general solution to this type of problem in Chapter 6, though, we’ll let this stand for now.

If you want to see what’s really going on in the os.walk generator, call its __next__ method (or equivalently, pass it to the next built-in function) manually a few times, just as the for loop does automatically; each time, you advance to the next subdirectory in the tree:

>>> gen = os.walk(r'C:\temp\test')
>>> gen.__next__()
('C:\\temp\\test', ['parts'], ['random.bin', 'spam.txt', 'temp.bin', 'temp.txt'])
>>> gen.__next__()
('C:\\temp\\test\\parts', [], ['part0001', 'part0002', 'part0003', 'part0004'])
>>> gen.__next__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>

The library manual documents os.walk further than we will here. For instance, it supports bottom-up instead of top-down walks with its optional topdown=False argument, and callers may prune tree branches by deleting names in the subdirectories lists of the yielded tuples.

Internally, the os.walk call generates filename lists at each level with the os.listdir call we met earlier, which collects both file and directory names in no particular order and returns them without their directory paths; os.walk segregates this list into subdirectories and files (technically, nondirectories) before yielding a result. Also note that walk uses the very same subdirectories list it yields to callers in order to later descend into subdirectories. Because lists are mutable objects that can be changed in place, if your code modifies the yielded subdirectory names list, it will impact what walk does next. For example, deleting directory names will prune traversal branches, and sorting the list will order the walk.

Recursive os.listdir traversals

The os.walk tool does the work of tree traversals for us; we simply provide loop code with task-specific logic. However, it’s sometimes more flexible and hardly any more work to do the walking ourselves. The following script recodes the directory listing script with a manual recursive traversal function (a function that calls itself to repeat its actions). The mylister function in Example 4-5 is almost the same as lister in Example 4-4 but calls os.listdir to generate file paths manually and calls itself recursively to descend into subdirectories.

Example 4-5. PP4E\System\Filetools\

# list files in dir tree by recursion

import sys, os

def mylister(currdir):
    print('[' + currdir + ']')
    for file in os.listdir(currdir):              # list files here
        path = os.path.join(currdir, file)        # add dir path back
        if not os.path.isdir(path):
            mylister(path)                        # recur into subdirs

if __name__ == '__main__':
    mylister(sys.argv[1])                         # dir name in cmdline

As usual, this file can be both imported and called or run as a script, though the fact that its result is printed text makes it less useful as an imported component unless its output stream is captured by another program.

When run as a script, this file’s output is equivalent to that of Example 4-4, but not identical—unlike the os.walk version, our recursive walker here doesn’t order the walk to visit files before stepping into subdirectories. It could by looping through the filenames list twice (selecting files first), but as coded, the order is dependent on os.listdir results. For most use cases, the walk order would be irrelevant:

C:\...\PP4E\System\Filetools> python C:\temp\test

We’ll make better use of most of this section’s techniques in later examples in Chapter 6 and in this book at large. For example, scripts for copying and comparing directory trees use the tree-walker techniques introduced here. Watch for these tools in action along the way. We’ll also code a find utility in Chapter 6 that combines the tree traversal of os.walk with the filename pattern expansion of glob.glob.

Handling Unicode Filenames in 3.X: listdir, walk, glob

Because all normal strings are Unicode in Python 3.X, the directory and file names generated by os.listdir, os.walk, and glob.glob so far in this chapter are technically Unicode strings. This can have some ramifications if your directories contain unusual names that might not decode properly.

Technically, because filenames may contain arbitrary text, the os.listdir works in two modes in 3.X: given a bytes argument, this function will return filenames as encoded byte strings; given a normal str string argument, it instead returns filenames as Unicode strings, decoded per the filesystem’s encoding scheme:

C:\...\PP4E\System\Filetools> python
>>> import os
>>> os.listdir('.')[:4]
['', '', '', '']

>>> os.listdir(b'.')[:4]
[b'', b'', b'', b'']

The byte string version can be used if undecodable file names may be present. Because os.walk and glob.glob both work by calling os.listdir internally, they inherit this behavior by proxy. The os.walk tree walker, for example, calls os.listdir at each directory level; passing byte string arguments suppresses decoding and returns byte string results:

>>> for (dir, subs, files) in os.walk('..'): print(dir)

>>> for (dir, subs, files) in os.walk(b'..'): print(dir)

The glob.glob tool similarly calls os.listdir internally before applying name patterns, and so also returns undecoded byte string names for byte string arguments:

>>> glob.glob('.\*')[:3]
['.\\bigext-out.txt', '.\\', '.\\']
>>> glob.glob(b'.\*')[:3]
[b'.\\bigext-out.txt', b'.\\', b'.\\']

Given a normal string name (as a command-line argument, for example), you can force the issue by converting to byte strings with manual encoding to suppress decoding:

>>> name = '.'
>>> os.listdir(name.encode())[:4]
[b'bigext-out.txt', b'', b'', b'']

The upshot is that if your directories may contain names which cannot be decoded according to the underlying platform’s Unicode encoding scheme, you may need to pass byte strings to these tools to avoid Unicode encoding errors. You’ll get byte strings back, which may be less readable if printed, but you’ll avoid errors while traversing directories and files.

This might be especially useful on systems that use simple encodings such as ASCII or Latin-1, but may contain files with arbitrarily encoded names from cross-machine copies, the Web, and so on. Depending upon context, exception handlers may be used to suppress some types of encoding errors as well.

We’ll see an example of how this can matter in the first section of Chapter 6, where an undecodable directory name generates an error if printed during a full disk scan (although that specific error seems more related to printing than to decoding in general).

Note that the basic open built-in function allows the name of the file being opened to be passed as either Unicode str or raw bytes, too, though this is used only to name the file initially; the additional mode argument determines whether the file’s content is handled in text or binary modes. Passing a byte string filename allows you to name files with arbitrarily encoded names.

Unicode policies: File content versus file names

In fact, it’s important to keep in mind that there are two different Unicode concepts related to files: the encoding of file content and the encoding of file name. Python provides your platform’s defaults for these settings in two different attributes; on Windows 7:

>>> import sys
>>> sys.getdefaultencoding()          # file content encoding, platform default
>>> sys.getfilesystemencoding()       # file name encoding, platform scheme

These settings allow you to be explicit when needed—the content encoding is used when data is read and written to the file, and the name encoding is used when dealing with names prior to transferring data. In addition, using bytes for file name tools may work around incompatibilities with the underlying file system’s scheme, and opening files in binary mode can suppress Unicode decoding errors for content.

As we’ve seen, though, opening text files in binary mode may also mean that the raw and still-encoded text will not match search strings as expected: search strings must also be byte strings encoded per a specific and possibly incompatible encoding scheme. In fact, this approach essentially mimics the behavior of text files in Python 2.X, and underscores why elevating Unicode in 3.X is generally desirable—such text files sometimes may appear to work even though they probably shouldn’t. On the other hand, opening text in binary mode to suppress Unicode content decoding and avoid decoding errors might still be useful if you do not wish to skip undecodable files and content is largely irrelevant.

As a rule of thumb, you should try to always provide an encoding name for text content if it might be outside the platform default, and you should rely on the default Unicode API for file names in most cases. Again, see Python’s manuals for more on the Unicode file name story than we have space to cover fully here, and see Learning Python, Fourth Edition, for more on Unicode in general.

In Chapter 6, we’re going to put the tools we met in this chapter to realistic use. For example, we’ll apply file and directory tools to implement file splitters, testing systems, directory copies and compares, and a variety of utilities based on tree walking. We’ll find that Python’s directory tools we met here have an enabling quality that allows us to automate a large set of real-world tasks. First, though, Chapter 5 concludes our basic tool survey, by exploring another system topic that tends to weave its way into a wide variety of application domains—parallel processing in Python.

[11] In fact, glob just uses the standard fnmatch module to match name patterns; see the fnmatch description in Chapter 6’s find module example for more details.

The best content for your career. Discover unlimited learning on demand for around $1/day.