You are previewing Programming Python, 4th Edition.

Programming Python, 4th Edition

Cover of Programming Python, 4th Edition by Mark Lutz Published by O'Reilly Media, Inc.
  1. Programming Python
  2. A Note Regarding Supplemental Files
  3. Preface
    1. “And Now for Something Completely Different…”
    2. About This Book
      1. This Book’s Ecosystem
      2. What This Book Is Not
    3. About This Fourth Edition
      1. Specific Changes in This Edition
    4. What’s Left, Then?
    5. Python 3.X Impacts on This Book
      1. Specific 3.X Changes
      2. Language Versus Library: Unicode
      3. Python 3.1 Limitations: Email, CGI
    6. Using Book Examples
      1. Where to Look for Examples and Updates
      2. Example Portability
      3. Demo Launchers
      4. Code Reuse Policies
    7. Contacting O’Reilly
    8. Conventions Used in This Book
    9. Acknowledgments
  4. I. The Beginning
    1. 1. A Sneak Preview
      1. “Programming Python: The Short Story”
      2. The Task
      3. Step 1: Representing Records
      4. Step 2: Storing Records Persistently
      5. Step 3: Stepping Up to OOP
      6. Step 4: Adding Console Interaction
      7. Step 5: Adding a GUI
      8. Step 6: Adding a Web Interface
      9. The End of the Demo
  5. II. System Programming
    1. 2. System Tools
      1. “The os.path to Knowledge”
      2. System Scripting Overview
      3. Introducing the sys Module
      4. Introducing the os Module
    2. 3. Script Execution Context
      1. “I’d Like to Have an Argument, Please”
      2. Current Working Directory
      3. Command-Line Arguments
      4. Shell Environment Variables
      5. Standard Streams
    3. 4. File and Directory Tools
      1. “Erase Your Hard Drive in Five Easy Steps!”
      2. File Tools
      3. Directory Tools
    4. 5. Parallel System Tools
      1. “Telling the Monkeys What to Do”
      2. Forking Processes
      3. Threads
      4. Program Exits
      5. Interprocess Communication
      6. The multiprocessing Module
      7. Other Ways to Start Programs
      8. A Portable Program-Launch Framework
      9. Other System Tools Coverage
    5. 6. Complete System Programs
      1. “The Greps of Wrath”
      2. A Quick Game of “Find the Biggest Python File”
      3. Splitting and Joining Files
      4. Generating Redirection Web Pages
      5. A Regression Test Script
      6. Copying Directory Trees
      7. Comparing Directory Trees
      8. Searching Directory Trees
      9. Visitor: Walking Directories “++”
      10. Playing Media Files
      11. Automated Program Launchers (External)
  6. III. GUI Programming
    1. 7. Graphical User Interfaces
      1. “Here’s Looking at You, Kid”
      2. Python GUI Development Options
      3. tkinter Overview
      4. Climbing the GUI Learning Curve
      5. tkinter Coding Alternatives
      6. Adding Buttons and Callbacks
      7. Adding User-Defined Callback Handlers
      8. Adding Multiple Widgets
      9. Customizing Widgets with Classes
      10. Reusable GUI Components with Classes
      11. The End of the Tutorial
      12. Python/tkinter for Tcl/Tk Converts
    2. 8. A tkinter Tour, Part 1
      1. “Widgets and Gadgets and GUIs, Oh My!”
      2. Configuring Widget Appearance
      3. Top-Level Windows
      4. Dialogs
      5. Binding Events
      6. Message and Entry
      7. Checkbutton, Radiobutton, and Scale
      8. Running GUI Code Three Ways
      9. Images
      10. Viewing and Processing Images with PIL
    3. 9. A tkinter Tour, Part 2
      1. “On Today’s Menu: Spam, Spam, and Spam”
      2. Menus
      3. Listboxes and Scrollbars
      4. Text
      5. Canvas
      6. Grids
      7. Time Tools, Threads, and Animation
      8. The End of the Tour
    4. 10. GUI Coding Techniques
      1. “Building a Better Mousetrap”
      2. GuiMixin: Common Tool Mixin Classes
      3. GuiMaker: Automating Menus and Toolbars
      4. ShellGui: GUIs for Command-Line Tools
      5. GuiStreams: Redirecting Streams to Widgets
      6. Reloading Callback Handlers Dynamically
      7. Wrapping Up Top-Level Window Interfaces
      8. GUIs, Threads, and Queues
      9. More Ways to Add GUIs to Non-GUI Code
      10. The PyDemos and PyGadgets Launchers
    5. 11. Complete GUI Programs
      1. “Python, Open Source, and Camaros”
      2. PyEdit: A Text Editor Program/Object
      3. PyPhoto: An Image Viewer and Resizer
      4. PyView: An Image and Notes Slideshow
      5. PyDraw: Painting and Moving Graphics
      6. PyClock: An Analog/Digital Clock Widget
      7. PyToe: A Tic-Tac-Toe Game Widget
      8. Where to Go from Here
  7. IV. Internet Programming
    1. 12. Network Scripting
      1. “Tune In, Log On, and Drop Out”
      2. Python Internet Development Options
      3. Plumbing the Internet
      4. Socket Programming
      5. Handling Multiple Clients
      6. Making Sockets Look Like Files and Streams
      7. A Simple Python File Server
    2. 13. Client-Side Scripting
      1. “Socket to Me!”
      2. FTP: Transferring Files over the Net
      3. Transferring Files with ftplib
      4. Transferring Directories with ftplib
      5. Transferring Directory Trees with ftplib
      6. Processing Internet Email
      7. POP: Fetching Email
      8. SMTP: Sending Email
      9. email: Parsing and Composing Mail Content
      10. A Console-Based Email Client
      11. The mailtools Utility Package
      12. NNTP: Accessing Newsgroups
      13. HTTP: Accessing Websites
      14. The urllib Package Revisited
      15. Other Client-Side Scripting Options
    3. 14. The PyMailGUI Client
      1. “Use the Source, Luke”
      2. Major PyMailGUI Changes
      3. A PyMailGUI Demo
      4. PyMailGUI Implementation
      5. Ideas for Improvement
    4. 15. Server-Side Scripting
      1. “Oh, What a Tangled Web We Weave”
      2. What’s a Server-Side CGI Script?
      3. Running Server-Side Examples
      4. Climbing the CGI Learning Curve
      5. Saving State Information in CGI Scripts
      6. The Hello World Selector
      7. Refactoring Code for Maintainability
      8. More on HTML and URL Escapes
      9. Transferring Files to Clients and Servers
    5. 16. The PyMailCGI Server
      1. “Things to Do When Visiting Chicago”
      2. The PyMailCGI Website
      3. The Root Page
      4. Sending Mail by SMTP
      5. Reading POP Email
      6. Processing Fetched Mail
      7. Utility Modules
      8. Web Scripting Trade-Offs
  8. V. Tools and Techniques
    1. 17. Databases and Persistence
      1. “Give Me an Order of Persistence, but Hold the Pickles”
      2. Persistence Options in Python
      3. DBM Files
      4. Pickled Objects
      5. Shelve Files
      6. The ZODB Object-Oriented Database
      7. SQL Database Interfaces
      8. ORMs: Object Relational Mappers
      9. PyForm: A Persistent Object Viewer (External)
    2. 18. Data Structures
      1. “Roses Are Red, Violets Are Blue; Lists Are Mutable, and So Is Set Foo”
      2. Implementing Stacks
      3. Implementing Sets
      4. Subclassing Built-in Types
      5. Binary Search Trees
      6. Graph Searching
      7. Permuting Sequences
      8. Reversing and Sorting Sequences
      9. PyTree: A Generic Tree Object Viewer
    3. 19. Text and Language
      1. “See Jack Hack. Hack, Jack, Hack”
      2. Strategies for Processing Text in Python
      3. String Method Utilities
      4. Regular Expression Pattern Matching
      5. XML and HTML Parsing
      6. Advanced Language Tools
      7. Custom Language Parsers
      8. PyCalc: A Calculator Program/Object
    4. 20. Python/C Integration
      1. “I Am Lost at C”
      2. Extending Python in C: Overview
      3. A Simple C Extension Module
      4. The SWIG Integration Code Generator
      5. Wrapping C Environment Calls
      6. Wrapping C++ Classes with SWIG
      7. Other Extending Tools
      8. Embedding Python in C: Overview
      9. Basic Embedding Techniques
      10. Registering Callback Handler Objects
      11. Using Python Classes in C
      12. Other Integration Topics
  9. VI. The End
    1. 21. Conclusion: Python and the Development Cycle
      1. “That’s the End of the Book, Now Here’s the Meaning of Life”
      2. “Something’s Wrong with the Way We Program Computers”
      3. The “Gilligan Factor”
      4. Doing the Right Thing
      5. Enter Python
      6. But What About That Bottleneck?
      7. On Sinking the Titanic
      8. “So What’s Python?”: The Sequel
      9. In the Final Analysis…
  10. Index
  11. About the Author
  12. Colophon
  13. Copyright

File Tools

External files are at the heart of much of what we do with system utilities. For instance, a testing system may read its inputs from one file, store program results in another file, and check expected results by loading yet another file. Even user interface and Internet-oriented programs may load binary images and audio clips from files on the underlying computer. It’s a core programming concept.

In Python, the built-in open function is the primary tool scripts use to access the files on the underlying computer system. Since this function is an inherent part of the Python language, you may already be familiar with its basic workings. When called, the open function returns a new file object that is connected to the external file; the file object has methods that transfer data to and from the file and perform a variety of file-related operations. The open function also provides a portable interface to the underlying filesystem—it works the same way on every platform on which Python runs.

Other file-related modules built into Python allow us to do things such as manipulate lower-level descriptor-based files (os); copy, remove, and move files and collections of files (os and shutil); store data and objects in files by key (dbm and shelve); and access SQL databases (sqlite3 and third-party add-ons). The last two of these categories are related to database topics, addressed in Chapter 17.

In this section, we’ll take a brief tutorial look at the built-in file object and explore a handful of more advanced file-related topics. As usual, you should consult either Python’s library manual or reference books such as Python Pocket Reference for further details and methods we don’t have space to cover here. Remember, for quick interactive help, you can also run dir(file) on an open file object to see an attributes list that includes methods; help(file) for general help; and help( for help on a specific method such as read, though the file object implementation in 3.1 provides less information for help than the library manual and other resources.

The File Object Model in Python 3.X

Just like the string types we noted in Chapter 2, file support in Python 3.X is a bit richer than it was in the past. As we noted earlier, in Python 3.X str strings always represent Unicode text (ASCII or wider), and bytes and bytearray strings represent raw binary data. Python 3.X draws a similar and related distinction between files containing text and binary data:

  • Text files contain Unicode text. In your script, text file content is always a str string—a sequence of characters (technically, Unicode “code points”). Text files perform the automatic line-end translations described in this chapter by default and automatically apply Unicode encodings to file content: they encode to and decode from raw binary bytes on transfers to and from the file, according to a provided or default encoding name. Encoding is trivial for ASCII text, but may be sophisticated in other cases.

  • Binary files contain raw 8-bit bytes. In your script, binary file content is always a byte string, usually a bytes object—a sequence of small integers, which supports most str operations and displays as ASCII characters whenever possible. Binary files perform no translations of data when it is transferred to and from files: no line-end translations or Unicode encodings are performed.

In practice, text files are used for all truly text-related data, and binary files store items like packed binary data, images, audio files, executables, and so on. As a programmer you distinguish between the two file types in the mode string argument you pass to open: adding a “b” (e.g., 'rb', 'wb') means the file contains binary data. For coding new file content, use normal strings for text (e.g., 'spam' or bytes.decode()) and byte strings for binary (e.g., b'spam' or str.encode()).

Unless your file scope is limited to ASCII text, the 3.X text/binary distinction can sometimes impact your code. Text files create and require str strings, and binary files use byte strings; because you cannot freely mix the two string types in expressions, you must choose file mode carefully. Many built-in tools we’ll use in this book make the choice for us; the struct and pickle modules, for instance, deal in byte strings in 3.X, and the xml package in Unicode str. You must even be aware of the 3.X text/binary distinction when using system tools like pipe descriptors and sockets, because they transfer data as byte strings today (though their content can be decoded and encoded as Unicode text if needed).

Moreover, because text-mode files require that content be decodable per a Unicode encoding scheme, you must read undecodable file content in binary mode, as byte strings (or catch Unicode exceptions in try statements and skip the file altogether). This may include both truly binary files as well as text files that use encodings that are nondefault and unknown. As we’ll see later in this chapter, because str strings are always Unicode in 3.X, it’s sometimes also necessary to select byte string mode for the names of files in directory tools such as os.listdir, glob.glob, and os.walk if they cannot be decoded (passing in byte strings essentially suppresses decoding).

In fact, we’ll see examples where the Python 3.X distinction between str text and bytes binary pops up in tools beyond basic files throughout this book—in Chapters 5 and 12 when we explore sockets; in Chapters 6 and 11 when we’ll need to ignore Unicode errors in file and directory searches; in Chapter 12, where we’ll see how client-side Internet protocol modules such as FTP and email, which run atop sockets, imply file modes and encoding requirements; and more.

But just as for string types, although we will see some of these concepts in action in this chapter, we’re going to take much of this story as a given here. File and string objects are core language material and are prerequisite to this text. As mentioned earlier, because they are addressed by a 45-page chapter in the book Learning Python, Fourth Edition, I won’t repeat their coverage in full in this book. If you find yourself confused by the Unicode and binary file and string concepts in the following sections, I encourage you to refer to that text or other resources for more background information in this domain.

Using Built-in File Objects

Despite the text/binary dichotomy in Python 3.X, files are still very straightforward to use. For most purposes, in fact, the open built-in function and its files objects are all you need to remember to process files in your scripts. The file object returned by open has methods for reading data (read, readline, readlines); writing data (write, writelines); freeing system resources (close); moving to arbitrary positions in the file (seek); forcing data in output buffers to be transferred to disk (flush); fetching the underlying file handle (fileno); and more. Since the built-in file object is so easy to use, let’s jump right into a few interactive examples.

Output files

To make a new file, call open with two arguments: the external name of the file to be created and a mode string w (short for write). To store data on the file, call the file object’s write method with a string containing the data to store, and then call the close method to close the file. File write calls return the number of characters or bytes written (which we’ll sometimes omit in this book to save space), and as we’ll see, close calls are often optional, unless you need to open and read the file again during the same program or session:

C:\temp> python
>>> file = open('data.txt', 'w')            # open output file object: creates
>>> file.write('Hello file world!\n')       # writes strings verbatim
>>> file.write('Bye   file world.\n')       # returns number chars/bytes written
>>> file.close()                            # closed on gc and exit too

And that’s it—you’ve just generated a brand-new text file on your computer, regardless of the computer on which you type this code:

C:\temp> dir data.txt /B

C:\temp> type data.txt
Hello file world!
Bye   file world.

There is nothing unusual about the new file; here, I use the DOS dir and type commands to list and display the new file, but it shows up in a file explorer GUI, too.


In the open function call shown in the preceding example, the first argument can optionally specify a complete directory path as part of the filename string. If we pass just a simple filename without a path, the file will appear in Python’s current working directory. That is, it shows up in the place where the code is run. Here, the directory C:\temp on my machine is implied by the bare filename data.txt, so this actually creates a file at C:\temp\data.txt. More accurately, the filename is relative to the current working directory if it does not include a complete absolute directory path. See Current Working Directory (Chapter 3), for a refresher on this topic.

Also note that when opening in w mode, Python either creates the external file if it does not yet exist or erases the file’s current contents if it is already present on your machine (so be careful out there—you’ll delete whatever was in the file before).


Notice that we added an explicit \n end-of-line character to lines written to the file; unlike the print built-in function, file object write methods write exactly what they are passed without adding any extra formatting. The string passed to write shows up character for character on the external file. In text files, data written may undergo line-end or Unicode translations which we’ll describe ahead, but these are undone when the data is later read back.

Output files also sport a writelines method, which simply writes all of the strings in a list one at a time without adding any extra formatting. For example, here is a writelines equivalent to the two write calls shown earlier:

file.writelines(['Hello file world!\n', 'Bye   file world.\n'])

This call isn’t as commonly used (and can be emulated with a simple for loop or other iteration tool), but it is convenient in scripts that save output in a list to be written later.


The file close method used earlier finalizes file contents and frees up system resources. For instance, closing forces buffered output data to be flushed out to disk. Normally, files are automatically closed when the file object is garbage collected by the interpreter (that is, when it is no longer referenced). This includes all remaining open files when the Python session or program exits. Because of that, close calls are often optional. In fact, it’s common to see file-processing code in Python in this idiom:

open('somefile.txt', 'w').write("G'day Bruce\n")       # write to temporary object
open('somefile.txt', 'r').read()                       # read from temporary object

Since both these expressions make a temporary file object, use it immediately, and do not save a reference to it, the file object is reclaimed right after data is transferred, and is automatically closed in the process. There is usually no need for such code to call the close method explicitly.

In some contexts, though, you may wish to explicitly close anyhow:

  • For one, because the Jython implementation relies on Java’s garbage collector, you can’t always be as sure about when files will be reclaimed as you can in standard Python. If you run your Python code with Jython, you may need to close manually if many files are created in a short amount of time (e.g. in a loop), in order to avoid running out of file resources on operating systems where this matters.

  • For another, some IDEs, such as Python’s standard IDLE GUI, may hold on to your file objects longer than you expect (in stack tracebacks of prior errors, for instance), and thus prevent them from being garbage collected as soon as you might expect. If you write to an output file in IDLE, be sure to explicitly close (or flush) your file if you need to reliably read it back during the same IDLE session. Otherwise, output buffers might not be flushed to disk and your file may be incomplete when read.

  • And while it seems very unlikely today, it’s not impossible that this auto-close on reclaim file feature could change in future. This is technically a feature of the file object’s implementation, which may or may not be considered part of the language definition over time.

For these reasons, manual close calls are not a bad idea in nontrivial programs, even if they are technically not required. Closing is a generally harmless but robust habit to form.

Ensuring file closure: Exception handlers and context managers

Manual file close method calls are easy in straight-line code, but how do you ensure file closure when exceptions might kick your program beyond the point where the close call is coded? First of all, make sure you must—files close themselves when they are collected, and this will happen eventually, even when exceptions occur.

If closure is required, though, there are two basic alternatives: the try statement’s finally clause is the most general, since it allows you to provide general exit actions for any type of exceptions:

myfile = open(filename, 'w')
    ...process myfile...

In recent Python releases, though, the with statement provides a more concise alternative for some specific objects and exit actions, including closing files:

with open(filename, 'w') as myfile:
    ...process myfile, auto-closed on statement exit...

This statement relies on the file object’s context manager: code automatically run both on statement entry and on statement exit regardless of exception behavior. Because the file object’s exit code closes the file automatically, this guarantees file closure whether an exception occurs during the statement or not.

The with statement is notably shorter (3 lines) than the try/finally alternative, but it’s also less general—with applies only to objects that support the context manager protocol, whereas try/finally allows arbitrary exit actions for arbitrary exception contexts. While some other object types have context managers, too (e.g., thread locks), with is limited in scope. In fact, if you want to remember just one exit actions option, try/finally is the most inclusive. Still, with yields less code for files that must be closed and can serve well in such specific roles. It can even save a line of code when no exceptions are expected (albeit at the expense of further nesting and indenting file processing logic):

myfile = open(filename, 'w')               # traditional form
...process myfile...

with open(filename) as myfile:             # context manager form
    ...process myfile...

In Python 3.1 and later, this statement can also specify multiple (a.k.a. nested) context managers—any number of context manager items may be separated by commas, and multiple items work the same as nested with statements. In general terms, the 3.1 and later code:

with A() as a, B() as b:

Runs the same as the following, which works in 3.1, 3.0, and 2.6:

with A() as a:
    with B() as b:

For example, when the with statement block exits in the following, both files’ exit actions are automatically run to close the files, regardless of exception outcomes:

with open('data') as fin, open('results', 'w') as fout:
    for line in fin:

Context manager–dependent code like this seems to have become more common in recent years, but this is likely at least in part because newcomers are accustomed to languages that require manual close calls in all cases. In most contexts there is no need to wrap all your Python file-processing code in with statements—the files object’s auto-close-on-collection behavior often suffices, and manual close calls are enough for many other scripts. You should use the with or try options outlined here only if you must close, and only in the presence of potential exceptions. Since standard C Python automatically closes files on collection, though, neither option is required in many (and perhaps most) scripts.

Input files

Reading data from external files is just as easy as writing, but there are more methods that let us load data in a variety of modes. Input text files are opened with either a mode flag of r (for “read”) or no mode flag at all—it defaults to r if omitted, and it commonly is. Once opened, we can read the lines of a text file with the readlines method:

C:\temp> python
>>> file = open('data.txt')                  # open input file object: 'r' default
>>> lines = file.readlines()                 # read into line string list
>>> for line in lines:                       # BUT use file line iterator! (ahead)
...     print(line, end='')                  # lines have a '\n' at end
Hello file world!
Bye   file world.

The readlines method loads the entire contents of the file into memory and gives it to our scripts as a list of line strings that we can step through in a loop. In fact, there are many ways to read an input file:

Returns a string containing all the characters (or bytes) stored in the file

Returns a string containing the next N characters (or bytes) from the file


Reads through the next \n and returns a line string


Reads the entire file and returns a list of line strings

Let’s run these method calls to read files, lines, and characters from a text file—the seek(0) call is used here before each test to rewind the file to its beginning (more on this call in a moment):

>>>                               # go back to the front of file
>>>                                # read entire file into string
'Hello file world!\nBye   file world.\n'

>>>                               # read entire file into lines list
>>> file.readlines()
['Hello file world!\n', 'Bye   file world.\n']

>>> file.readline()                            # read one line at a time
'Hello file world!\n'
>>> file.readline()
'Bye   file world.\n'
>>> file.readline()                            # empty string at end-of-file

>>>                               # read N (or remaining) chars/bytes
>>>,                 # empty string at end-of-file
('H', 'ello fil')

All of these input methods let us be specific about how much to fetch. Here are a few rules of thumb about which to choose:

  • read() and readlines() load the entire file into memory all at once. That makes them handy for grabbing a file’s contents with as little code as possible. It also makes them generally fast, but costly in terms of memory for huge files—loading a multigigabyte file into memory is not generally a good thing to do (and might not be possible at all on a given computer).

  • On the other hand, because the readline() and read(N) calls fetch just part of the file (the next line or N-character-or-byte block), they are safer for potentially big files but a bit less convenient and sometimes slower. Both return an empty string when they reach end-of-file. If speed matters and your files aren’t huge, read or readlines may be a generally better choice.

  • See also the discussion of the newer file iterators in the next section. As we’ll see, iterators combine the convenience of readlines() with the space efficiency of readline() and are the preferred way to read text files by lines today.

The seek(0) call used repeatedly here means “go back to the start of the file.” In our example, it is an alternative to reopening the file each time. In files, all read and write operations take place at the current position; files normally start at offset 0 when opened and advance as data is transferred. The seek call simply lets us move to a new position for the next transfer operation. More on this method later when we explore random access files.

Reading lines with file iterators

In older versions of Python, the traditional way to read a file line by line in a for loop was to read the file into a list that could be stepped through as usual:

>>> file = open('data.txt')
>>> for line in file.readlines():    # DON'T DO THIS ANYMORE!
...     print(line, end='')

If you’ve already studied the core language using a first book like Learning Python, you may already know that this coding pattern is actually more work than is needed today—both for you and your computer’s memory. In recent Pythons, the file object includes an iterator which is smart enough to grab just one line per request in all iteration contexts, including for loops and list comprehensions. The practical benefit of this extension is that you no longer need to call readlines in a for loop to scan line by line—the iterator reads lines on request automatically:

>>> file = open('data.txt')
>>> for line in file:                  # no need to call readlines
...     print(line, end='')            # iterator reads next line each time
Hello file world!
Bye   file world.

Better still, you can open the file in the loop statement itself, as a temporary which will be automatically closed on garbage collection when the loop ends (that’s normally the file’s sole reference):

>>> for line in open('data.txt'):      # even shorter: temporary file object
...     print(line, end='')            # auto-closed when garbage collected
Hello file world!
Bye   file world.

Moreover, this file line-iterator form does not load the entire file into a line’s list all at once, so it will be more space efficient for large text files. Because of that, this is the prescribed way to read line by line today. If you want to see what really happens inside the for loop, you can use the iterator manually; it’s just a __next__ method (run by the next built-in function), which is similar to calling the readline method each time through, except that read methods return an empty string at end-of-file (EOF) and the iterator raises an exception to end the iteration:

>>> file = open('data.txt')      # read methods: empty at EOF
>>> file.readline()
'Hello file world!\n'
>>> file.readline()
'Bye   file world.\n'
>>> file.readline()

>>> file = open('data.txt')      # iterators: exception at EOF
>>> file.__next__()              # no need to call iter(file) first,
'Hello file world!\n'            # since files are their own iterator
>>> file.__next__()
'Bye   file world.\n'
>>> file.__next__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>

Interestingly, iterators are automatically used in all iteration contexts, including the list constructor call, list comprehension expressions, map calls, and in membership checks:

>>> open('data.txt').readlines()                             # always read lines
['Hello file world!\n', 'Bye   file world.\n']

>>> list(open('data.txt'))                                   # force line iteration
['Hello file world!\n', 'Bye   file world.\n']

>>> lines = [line.rstrip() for line in open('data.txt')]     # comprehension
>>> lines
['Hello file world!', 'Bye   file world.']

>>> lines = [line.upper() for line in open('data.txt')]      # arbitrary actions
>>> lines

>>> list(map(str.split, open('data.txt')))                   # apply a function
[['Hello', 'file', 'world!'], ['Bye', 'file', 'world.']]

>>> line = 'Hello file world!\n'
>>> line in open('data.txt')                                 # line membership

Iterators may seem somewhat implicit at first glance, but they’re representative of the many ways that Python makes developers’ lives easier over time.

Other open options

Besides the w and (default) r file open modes, most platforms support an a mode string, meaning “append.” In this output mode, write methods add data to the end of the file, and the open call will not erase the current contents of the file:

>>> file = open('data.txt', 'a')          # open in append mode: doesn't erase
>>> file.write('The Life of Brian')       # added at end of existing data
>>> file.close()
>>> open('data.txt').read()               # open and read entire file
'Hello file world!\nBye   file world.\nThe Life of Brian'

In fact, although most files are opened using the sorts of calls we just ran, open actually supports additional arguments for more specific processing needs, the first three of which are the most commonly used—the filename, the open mode, and a buffering specification. All but the first of these are optional: if omitted, the open mode argument defaults to r (input), and the buffering policy is to enable full buffering. For special needs, here are a few things you should know about these three open arguments:


As mentioned earlier, filenames can include an explicit directory path to refer to files in arbitrary places on your computer; if they do not, they are taken to be names relative to the current working directory (described in the prior chapter). In general, most filename forms you can type in your system shell will work in an open call. For instance, a relative filename argument r'..\temp\spam.txt' on Windows means spam.txt in the temp subdirectory of the current working directory’s parent—up one, and down to directory temp.

Open mode

The open function accepts other modes, too, some of which we’ll see at work later in this chapter: r+, w+, and a+ to open for reads and writes, and any mode string with a b to designate binary mode. For instance, mode r+ means both reads and writes are allowed on an existing file; w+ allows reads and writes but creates the file anew, erasing any prior content; rb and wb read and write data in binary mode without any translations; and wb+ and r+b both combine binary mode and input plus output. In general, the mode string defaults to r for read but can be w for write and a for append, and you may add a + for update, as well as a b or t for binary or text mode; order is largely irrelevant.

As we’ll see later in this chapter, the + modes are often used in conjunction with the file object’s seek method to achieve random read/write access. Regardless of mode, file contents are always strings in Python programs—read methods return a string, and we pass a string to write methods. As also described later, though, the mode string implies which type of string is used: str for text mode or bytes and other byte string types for binary mode.

Buffering policy

The open call also takes an optional third buffering policy argument which lets you control buffering for the file—the way that data is queued up before being transferred, to boost performance. If passed, 0 means file operations are unbuffered (data is transferred immediately, but allowed in binary modes only), 1 means they are line buffered, and any other positive value means to use a full buffering (which is the default, if no buffering argument is passed).

As usual, Python’s library manual and reference texts have the full story on additional open arguments beyond these three. For instance, the open call supports additional arguments related to the end-of-line mapping behavior and the automatic Unicode encoding of content performed for text-mode files. Since we’ll discuss both of these concepts in the next section, let’s move ahead.

Binary and Text Files

All of the preceding examples process simple text files, but Python scripts can also open and process files containing binary data—JPEG images, audio clips, packed binary data produced by FORTRAN and C programs, encoded text, and anything else that can be stored in files as bytes. The primary difference in terms of your code is the mode argument passed to the built-in open function:

>>> file = open('data.txt', 'wb')      # open binary output file
>>> file = open('data.txt', 'rb')      # open binary input file

Once you’ve opened binary files in this way, you may read and write their contents using the same methods just illustrated: read, write, and so on. The readline and readlines methods as well as the file’s line iterator still work here for text files opened in binary mode, but they don’t make sense for truly binary data that isn’t line oriented (end-of-line bytes are meaningless, if they appear at all).

In all cases, data transferred between files and your programs is represented as Python strings within scripts, even if it is binary data. For binary mode files, though, file content is represented as byte strings. Continuing with our text file from preceding examples:

>>> open('data.txt').read()                                   # text mode: str
'Hello file world!\nBye   file world.\nThe Life of Brian'

>>> open('data.txt', 'rb').read()                             # binary mode: bytes
b'Hello file world!\r\nBye   file world.\r\nThe Life of Brian'

>>> file = open('data.txt', 'rb')
>>> for line in file: print(line)
b'Hello file world!\r\n'
b'Bye   file world.\r\n'
b'The Life of Brian'

This occurs because Python 3.X treats text-mode files as Unicode, and automatically decodes content on input and encodes it on output. Binary mode files instead give us access to file content as raw byte strings, with no translation of content—they reflect exactly what is stored on the file. Because str strings are always Unicode text in 3.X, the special bytes string is required to represent binary data as a sequence of byte-size integers which may contain any 8-bit value. Because normal and byte strings have almost identical operation sets, many programs can largely take this on faith; but keep in mind that you really must open truly binary data in binary mode for input, because it will not generally be decodable as Unicode text.

Similarly, you must also supply byte strings for binary mode output—normal strings are not raw binary data, but are decoded Unicode characters (a.k.a. code points) which are encoded to binary on text-mode output:

>>> open('data.bin', 'wb').write(b'Spam\n')
>>> open('data.bin', 'rb').read()

>>> open('data.bin', 'wb').write('spam\n')
TypeError: must be bytes or buffer, not str

But notice that this file’s line ends with just \n, instead of the Windows \r\n that showed up in the preceding example for the text file in binary mode. Strictly speaking, binary mode disables Unicode encoding translation, but it also prevents the automatic end-of-line character translation performed by text-mode files by default. Before we can understand this fully, though, we need to study the two main ways in which text files differ from binary.

Unicode encodings for text files

As mentioned earlier, text-mode file objects always translate data according to a default or provided Unicode encoding type, when the data is transferred to and from external file. Their content is encoded on files, but decoded in memory. Binary mode files don’t perform any such translation, which is what we want for truly binary data. For instance, consider the following string, which embeds a Unicode character whose binary value is outside the normal 7-bit range of the ASCII encoding standard:

>>> data = 'sp\xe4m'
>>> data
>>> 0xe4, bin(0xe4), chr(0xe4)
(228, '0b11100100', 'ä')

It’s possible to manually encode this string according to a variety of Unicode encoding types—its raw binary byte string form is different under some encodings:

>>> data.encode('latin1')                  # 8-bit characters: ascii + extras

>>> data.encode('utf8')                    # 2 bytes for special characters only

>>> data.encode('ascii')                   # does not encode per ascii
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 2:
ordinal not in range(128)

Python displays printable characters in these strings normally, but nonprintable bytes show as \xNN hexadecimal escapes which become more prevalent under more sophisticated encoding schemes (cp500 in the following is an EBCDIC encoding):

>>> data.encode('utf16')                   # 2 bytes per character plus preamble

>>> data.encode('cp500')                   # an ebcdic encoding: very different

The encoded results here reflect the string’s raw binary form when stored in files. Manual encoding is usually unnecessary, though, because text files handle encodings automatically on data transfers—reads decode and writes encode, according to the encoding name passed in (or a default for the underlying platform: see sys.getdefaultencoding). Continuing our interactive session:

>>> open('data.txt', 'w', encoding='latin1').write(data)
>>> open('data.txt', 'r', encoding='latin1').read()
>>> open('data.txt', 'rb').read()

If we open in binary mode, though, no encoding translation occurs—the last command in the preceding example shows us what’s actually stored on the file. To see how file content differs for other encodings, let’s save the same string again:

>>> open('data.txt', 'w', encoding='utf8').write(data)        # encode data per utf8
>>> open('data.txt', 'r', encoding='utf8').read()             # decode: undo encoding
>>> open('data.txt', 'rb').read()                             # no data translations

This time, raw file content is different, but text mode’s auto-decoding makes the string the same by the time it’s read back by our script. Really, encodings pertain only to strings while they are in files; once they are loaded into memory, strings are simply sequences of Unicode characters (“code points”). This translation step is what we want for text files, but not for binary. Because binary modes skip the translation, you’ll want to use them for truly binary data. If fact, you usually must—trying to write unencodable data and attempting to read undecodable data is an error:

>>> open('data.txt', 'w', encoding='ascii').write(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 2:
ordinal not in range(128)

>>> open(r'C:\Python31\python.exe', 'r').read()
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2:
character maps to <undefined>

Binary mode is also a last resort for reading text files, if they cannot be decoded per the underlying platform’s default, and the encoding type is unknown—the following recreates the original strings if encoding type is known, but fails if it is not known unless binary mode is used (such failure may occur either on inputting the data or printing it, but it fails nevertheless):

>>> open('data.txt', 'w', encoding='cp500').writelines(['spam\n', 'ham\n'])
>>> open('data.txt', 'r', encoding='cp500').readlines()
['spam\n', 'ham\n']

>>> open('data.txt', 'r').readlines()
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2:
character maps to <undefined>

>>> open('data.txt', 'rb').readlines()

>>> open('data.txt', 'rb').read()

If all your text is ASCII you generally can ignore encoding altogether; data in files maps directly to characters in strings, because ASCII is a subset of most platforms’ default encodings. If you must process files created with other encodings, and possibly on different platforms (obtained from the Web, for instance), binary mode may be required if encoding type is unknown. Keep in mind, however, that text in still-encoded binary form might not work as you expect: because it is encoded per a given encoding scheme, it might not accurately compare or combine with text encoded in other schemes.

Again, see other resources for more on the Unicode story. We’ll revisit the Unicode story at various points in this book, especially in Chapter 9, to see how it relates to the tkinter Text widget, and in Part IV, covering Internet programming, to learn what it means for data shipped over networks by protocols such as FTP, email, and the Web at large. Text files have another feature, though, which is similarly a nonfeature for binary data: line-end translations, the topic of the next section.

End-of-line translations for text files

For historical reasons, the end of a line of text in a file is represented by different characters on different platforms. It’s a single \n character on Unix-like platforms, but the two-character sequence \r\n on Windows. That’s why files moved between Linux and Windows may look odd in your text editor after transfer—they may still be stored using the original platform’s end-of-line convention.

For example, most Windows editors handle text in Unix format, but Notepad has been a notable exception—text files copied from Unix or Linux may look like one long line when viewed in Notepad, with strange characters inside (\n). Similarly, transferring a file from Windows to Unix in binary mode retains the \r characters (which often appear as ^M in text editors).

Python scripts that process text files don’t normally have to care, because the files object automatically maps the DOS \r\n sequence to a single \n. It works like this by default—when scripts are run on Windows:

  • For files opened in text mode, \r\n is translated to \n when input.

  • For files opened in text mode, \n is translated to \r\n when output.

  • For files opened in binary mode, no translation occurs on input or output.

On Unix-like platforms, no translations occur, because \n is used in files. You should keep in mind two important consequences of these rules. First, the end-of-line character for text-mode files is almost always represented as a single \n within Python scripts, regardless of how it is stored in external files on the underlying platform. By mapping to and from \n on input and output, Python hides the platform-specific difference.

The second consequence of the mapping is subtler: when processing binary files, binary open modes (e.g, rb, wb) effectively turn off line-end translations. If they did not, the translations listed previously could very well corrupt data as it is input or output—a random \r in data might be dropped on input, or added for a \n in the data on output. The net effect is that your binary data would be trashed when read and written—probably not quite what you want for your audio files and images!

This issue has become almost secondary in Python 3.X, because we generally cannot use binary data with text-mode files anyhow—because text-mode files automatically apply Unicode encodings to content, transfers will generally fail when the data cannot be decoded on input or encoded on output. Using binary mode avoids Unicode errors, and automatically disables line-end translations as well (Unicode error can be caught in try statements as well). Still, the fact that binary mode prevents end-of-line translations to protect file content is best noted as a separate feature, especially if you work in an ASCII-only world where Unicode encoding issues are irrelevant.

Here’s the end-of-line translation at work in Python 3.1 on Windows—text mode translates to and from the platform-specific line-end sequence so our scripts are portable:

>>> open('temp.txt', 'w').write('shrubbery\n')   # text output mode: \n -> \r\n
>>> open('temp.txt', 'rb').read()                # binary input: actual file bytes
>>> open('temp.txt', 'r').read()                 # test input mode: \r\n -> \n

By contrast, writing data in binary mode prevents all translations as expected, even if the data happens to contain bytes that are part of line-ends in text mode (byte strings print their characters as ASCII if printable, else as hexadecimal escapes):

>>> data = b'a\0b\rc\r\nd'                       # 4 escape code bytes, 4 normal
>>> len(data)
>>> open('temp.bin', 'wb').write(data)           # write binary data to file as is
>>> open('temp.bin', 'rb').read()                # read as binary: no translation

But reading binary data in text mode, whether accidental or not, can corrupt the data when transferred because of line-end translations (assuming it passes as decodable at all; ASCII bytes like these do on this Windows platform):

>>> open('temp.bin', 'r').read()                 # text mode read: botches \r !

Similarly, writing binary data in text mode can have as the same effect—line-end bytes may be changed or inserted (again, assuming the data is encodable per the platform’s default):

>>> open('temp.bin', 'w').write(data)            # must pass str for text mode
TypeError: must be str, not bytes                # use bytes.decode() for to-str
>>> data.decode()

>>> open('temp.bin', 'w').write(data.decode())
>>> open('temp.bin', 'rb').read()                # text mode write: added \r !

>>> open('temp.bin', 'r').read()                 # again drops, alters \r on input

The short story to remember here is that you should generally use \n to refer to end-line in all your text file content, and you should always open binary data in binary file modes to suppress both end-of-line translations and any Unicode encodings. A file’s content generally determines its open mode, and file open modes usually process file content exactly as we want.

Keep in mind, though, that you might also need to use binary file modes for text in special contexts. For instance, in Chapter 6’s examples, we’ll sometimes open text files in binary mode to avoid possible Unicode decoding errors, for files generated on arbitrary platforms that may have been encoded in arbitrary ways. Doing so avoids encoding errors, but also can mean that some text might not work as expected—searches might not always be accurate when applied to such raw text, since the search key must be in bytes string formatted and encoded according to a specific and possibly incompatible encoding scheme.

In Chapter 11’s PyEdit, we’ll also need to catch Unicode exceptions in a “grep” directory file search utility, and we’ll go further to allow Unicode encodings to be specified for file content across entire trees. Moreover, a script that attempts to translate between different platforms’ end-of-line character conventions explicitly may need to read text in binary mode to retain the original line-end representation truly present in the file; in text mode, they would already be translated to \n by the time they reached the script.

It’s also possible to disable or further tailor end-of-line translations in text mode with additional open arguments we will finesse here. See the newline argument in open reference documentation for details; in short, passing an empty string to this argument also prevents line-end translation but retains other text-mode behavior. For this chapter, let’s turn next to two common use cases for binary data files: packed binary data and random access.

Parsing packed binary data with the struct module

By using the letter b in the open call, you can open binary datafiles in a platform-neutral way and read and write their content with normal file object methods. But how do you process binary data once it has been read? It will be returned to your script as a simple string of bytes, most of which are probably not printable characters.

If you just need to pass binary data along to another file or program, your work is done—for instance, simply pass the byte string to another file opened in binary mode. And if you just need to extract a number of bytes from a specific position, string slicing will do the job; you can even follow up with bitwise operations if you need to. To get at the contents of binary data in a structured way, though, as well as to construct its contents, the standard library struct module is a more powerful alternative.

The struct module provides calls to pack and unpack binary data, as though the data was laid out in a C-language struct declaration. It is also capable of composing and decomposing using any endian-ness you desire (endian-ness determines whether the most significant bits of binary numbers are on the left or right side). Building a binary datafile, for instance, is straightforward—pack Python values into a byte string and write them to a file. The format string here in the pack call means big-endian (>), with an integer, four-character string, half integer, and floating-point number:

>>> import struct
>>> data = struct.pack('>i4shf', 2, 'spam', 3, 1.234)
>>> data
>>> file = open('data.bin', 'wb')
>>> file.write(data)
>>> file.close()

Notice how the struct module returns a bytes string: we’re in the realm of binary data here, not text, and must use binary mode files to store. As usual, Python displays most of the packed binary data’s bytes here with \xNN hexadecimal escape sequences, because the bytes are not printable characters. To parse data like that which we just produced, read it off the file and pass it to the struct module with the same format string—you get back a tuple containing the values parsed out of the string and converted to Python objects:

>>> import struct
>>> file   = open('data.bin', 'rb')
>>> data   =
>>> values = struct.unpack('>>i4shf', data)
>>> values
(2, b'spam', 3, 1.2339999675750732)

Parsed-out strings are byte strings again, and we can apply string and bitwise operations to probe deeper:

>>> bin(values[0] | 0b1)                            # accessing bits and bytes
>>> values[1], list(values[1]), values[1][0]
(b'spam', [115, 112, 97, 109], 115)

Also note that slicing comes in handy in this domain; to grab just the four-character string in the middle of the packed binary data we just read, we can simply slice it out. Numeric values could similarly be sliced out and then passed to struct.unpack for conversion:

>>> bytes
>>> bytes[4:8]

>>> number = bytes[8:10]
>>> number
>>> struct.unpack('>h', number)

Packed binary data crops up in many contexts, including some networking tasks, and in data produced by other programming languages. Because it’s not part of every programming job’s description, though, we’ll defer to the struct module’s entry in the Python library manual for more details.

Random access files

Binary files also typically see action in random access processing. Earlier, we mentioned that adding a + to the open mode string allows a file to be both read and written. This mode is typically used in conjunction with the file object’s seek method to support random read/write access. Such flexible file processing modes allow us to read bytes from one location, write to another, and so on. When scripts combine this with binary file modes, they may fetch and update arbitrary bytes within a file.

We used seek earlier to rewind files instead of closing and reopening. As mentioned, read and write operations always take place at the current position in the file; files normally start at offset 0 when opened and advance as data is transferred. The seek call lets us move to a new position for the next transfer operation by passing in a byte offset.

Python’s seek method also accepts an optional second argument that has one of three values—0 for absolute file positioning (the default); 1 to seek relative to the current position; and 2 to seek relative to the file’s end. That’s why passing just an offset of 0 to seek is roughly a file rewind operation: it repositions the file to its absolute start. In general, seek supports random access on a byte-offset basis. Seeking to a multiple of a record’s size in a binary file, for instance, allows us to fetch a record by its relative position.

Although you can use seek without + modes in open (e.g., to just read from random locations), it’s most flexible when combined with input/output files. And while you can perform random access in text mode, too, the fact that text modes perform Unicode encodings and line-end translations make them difficult to use when absolute byte offsets and lengths are required for seeks and reads—your data may look very different when stored in files. Text mode may also make your data nonportable to platforms with different default encodings, unless you’re willing to always specify an explicit encoding for opens. Except for simple unencoded ASCII text without line-ends, seek tends to works best with binary mode files.

To demonstrate, let’s create a file in w+b mode (equivalent to wb+) and write some data to it; this mode allows us to both read and write, but initializes the file to be empty if it’s already present (all w modes do). After writing some data, we seek back to file start to read its content (some integer return values are omitted in this example again for brevity):

>>> records = [bytes([char] * 8) for char in b'spam']
>>> records
[b'ssssssss', b'pppppppp', b'aaaaaaaa', b'mmmmmmmm']

>>> file = open('random.bin', 'w+b')
>>> for rec in records:                                   # write four records
...     size = file.write(rec)                            # bytes for binary mode
>>> file.flush()
>>> pos =                                    # read entire file
>>> print(

Now, let’s reopen our file in r+b mode; this mode allows both reads and writes again, but does not initialize the file to be empty. This time, we seek and read in multiples of the size of data items (“records”) stored, to both fetch and update them at random:

c:\temp> python
>>> file = open('random.bin', 'r+b')
>>> print(                           # read entire file

>>> record = b'X' * 8
>>>                                 # update first record
>>> file.write(record)
>>> * 2)                   # update third record
>>> file.write(b'Y' * 8)

>>>                       # fetch second record
>>>                       # fetch next (third) record

>>>                                 # read entire file

c:\temp> type random.bin                         # the view outside Python

Finally, keep in mind that seek can be used to achieve random access, even if it’s just for input. The following seeks in multiples of record size to read (but not write) fixed-length records at random. Notice that it also uses r text mode: since this data is simple ASCII text bytes and has no line-ends, text and binary modes work the same on this platform:

c:\temp> python
>>> file = open('random.bin', 'r')        # text mode ok if no encoding/endlines
>>> reclen = 8
>>> * 3)                 # fetch record 4
>>> * 1)                 # fetch record 2

>>> file = open('random.bin', 'rb')       # binary mode works the same here
>>> * 2)                 # fetch record 3
>>>                     # returns byte strings

But unless your file’s content is always a simple unencoded text form like ASCII and has no translated line-ends, text mode should not generally be used if you are going to seek—line-ends may be translated on Windows and Unicode encodings may make arbitrary transformations, both of which can make absolute seek offsets difficult to use. In the following, for example, the positions of characters after the first non-ASCII no longer match between the string in Python and its encoded representation on the file:

>>> data = 'sp\xe4m'                                 # data to your script
>>> data, len(data)                                  # 4 unicode chars, 1 nonascii
('späm', 4)
>>> data.encode('utf8'), len(data.encode('utf8'))    # bytes written to file
(b'sp\xc3\xa4m', 5)

>>> f = open('test', mode='w+', encoding='utf8')     # use text mode, encoded
>>> f.write(data)
>>> f.flush()
>>>;                             # ascii bytes work
>>>;                             # as does 2-byte nonascii
>>> data[3]                                          # but offset 3 is not 'm' !
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0:
unexpected code byte

As you can see, Python’s file modes provide flexible file processing for programs that require it. In fact, the os module offers even more file processing options, as the next section describes.

Lower-Level File Tools in the os Module

The os module contains an additional set of file-processing functions that are distinct from the built-in file object tools demonstrated in previous examples. For instance, here is a partial list of os file-related calls: path, flags, mode )

Opens a file and returns its descriptor descriptor, N )

Reads at most N bytes and returns a byte string

os.write( descriptor, string )

Writes bytes in byte string string to the file

os.lseek( descriptor, position , how )

Moves to position in the file

Technically, os calls process files by their descriptors, which are integer codes or “handles” that identify files in the operating system. Descriptor-based files deal in raw bytes, and have no notion of the line-end or Unicode translations for text that we studied in the prior section. In fact, apart from extras like buffering, descriptor-based files generally correspond to binary mode file objects, and we similarly read and write bytes strings, not str strings. However, because the descriptor-based file tools in os are lower level and more complex than the built-in file objects created with the built-in open function, you should generally use the latter for all but very special file-processing needs.[9]

Using files

To give you the general flavor of this tool set, though, let’s run a few interactive experiments. Although built-in file objects and os module descriptor files are processed with distinct tool sets, they are in fact related—the file system used by file objects simply adds a layer of logic on top of descriptor-based files.

In fact, the fileno file object method returns the integer descriptor associated with a built-in file object. For instance, the standard stream file objects have descriptors 0, 1, and 2; calling the os.write function to send data to stdout by descriptor has the same effect as calling the sys.stdout.write method:

>>> import sys
>>> for stream in (sys.stdin, sys.stdout, sys.stderr):
...     print(stream.fileno())

>>> sys.stdout.write('Hello stdio world\n')        # write via file method
Hello stdio world
>>> import os
>>> os.write(1, b'Hello descriptor world\n')       # write via os module
Hello descriptor world

Because file objects we open explicitly behave the same way, it’s also possible to process a given real external file on the underlying computer through the built-in open function, tools in the os module, or both (some integer return values are omitted here for brevity):

>>> file = open(r'C:\temp\spam.txt', 'w')       # create external file, object
>>> file.write('Hello stdio file\n')            # write via file object method
>>> file.flush()                                # else os.write to disk first!
>>> fd = file.fileno()                          # get descriptor from object
>>> fd
>>> import os
>>> os.write(fd, b'Hello descriptor file\n')    # write via os module
>>> file.close()

C:\temp> type spam.txt                          # lines from both schemes
Hello stdio file
Hello descriptor file mode flags

So why the extra file tools in os? In short, they give more low-level control over file processing. The built-in open function is easy to use, but it may be limited by the underlying filesystem that it uses, and it adds extra behavior that we do not want. The os module lets scripts be more specific—for example, the following opens a descriptor-based file in read-write and binary modes by performing a binary “or” on two mode flags exported by os:

>>> fdfile ='C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>, 20)
b'Hello stdio file\r\nHe'

>>> os.lseek(fdfile, 0, 0)                        # go back to start of file
>>>, 100)                          # binary mode retains "\r\n"
b'Hello stdio file\r\nHello descriptor file\n'

>>> os.lseek(fdfile, 0, 0)
>>> os.write(fdfile, b'HELLO')                    # overwrite first 5 bytes

C:\temp> type spam.txt
HELLO stdio file
Hello descriptor file

In this case, binary mode strings rb+ and r+b in the basic open call are equivalent:

>>> file = open(r'C:\temp\spam.txt', 'rb+')       # same but with open/objects
b'HELLO stdio file\r\nHe'
b'HELLO stdio file\r\nHello descriptor file\n'
>>> file.write(b'Jello')
b'Jello stdio file\r\nHello descriptor file\n'

But on some systems, flags let us specify more advanced things like exclusive access (O_EXCL) and nonblocking modes (O_NONBLOCK) when a file is opened. Some of these flags are not portable across platforms (another reason to use built-in file objects most of the time); see the library manual or run a dir(os) call on your machine for an exhaustive list of other open flags available.

One final note here: using with the O_EXCL flag is the most portable way to lock files for concurrent updates or other process synchronization in Python today. We’ll see contexts where this can matter in the next chapter, when we begin to explore multiprocessing tools. Programs running in parallel on a server machine, for instance, may need to lock files before performing updates, if multiple threads or processes might attempt such updates at the same time.

Wrapping descriptors in file objects

We saw earlier how to go from file object to field descriptor with the fileno file object method; given a descriptor, we can use os module tools for lower-level file access to the underlying file. We can also go the other way—the os.fdopen call wraps a file descriptor in a file object. Because conversions work both ways, we can generally use either tool set—file object or os module:

>>> fdfile ='C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>> fdfile
>>> objfile = os.fdopen(fdfile, 'rb')
b'Jello stdio file\r\nHello descriptor file\n'

In fact, we can wrap a file descriptor in either a binary or text-mode file object: in text mode, reads and writes perform the Unicode encodings and line-end translations we studied earlier and deal in str strings instead of bytes:

C:\...\PP4E\System> python
>>> import os
>>> fdfile ='C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>> objfile = os.fdopen(fdfile, 'r')
'Jello stdio file\nHello descriptor file\n'

In Python 3.X, the built-in open call also accepts a file descriptor instead of a file name string; in this mode it works much like os.fdopen, but gives you greater control—for example, you can use additional arguments to specify a nondefault Unicode encoding for text and suppress the default descriptor close. Really, though, os.fdopen accepts the same extra-control arguments in 3.X, because it has been redefined to do little but call back to the built-in open (see in the standard library):

C:\...\PP4E\System> python
>>> import os
>>> fdfile ='C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>> fdfile
>>> objfile = open(fdfile, 'r', encoding='latin1', closefd=False)
'Jello stdio file\nHello descriptor file\n'

>>> objfile = os.fdopen(fdfile, 'r', encoding='latin1', closefd=True)
'Jello stdio file\nHello descriptor file\n'

We’ll make use of this file object wrapper technique to simplify text-oriented pipes and other descriptor-like objects later in this book (e.g., sockets have a makefile method which achieves similar effects).

Other os module file tools

The os module also includes an assortment of file tools that accept a file pathname string and accomplish file-related tasks such as renaming (os.rename), deleting (os.remove), and changing the file’s owner and permission settings (os.chown, os.chmod). Let’s step through a few examples of these tools in action:

>>> os.chmod('spam.txt', 0o777)          # enable all accesses

This os.chmod file permissions call passes a 9-bit string composed of three sets of three bits each. From left to right, the three sets represent the file’s owning user, the file’s group, and all others. Within each set, the three bits reflect read, write, and execute access permissions. When a bit is “1” in this string, it means that the corresponding operation is allowed for the assessor. For instance, octal 0777 is a string of nine “1” bits in binary, so it enables all three kinds of accesses for all three user groups; octal 0600 means that the file can be read and written only by the user that owns it (when written in binary, 0600 octal is really bits 110 000 000).

This scheme stems from Unix file permission settings, but the call works on Windows as well. If it’s puzzling, see your system’s documentation (e.g., a Unix manpage) for chmod. Moving on:

>>> os.rename(r'C:\temp\spam.txt', r'C:\temp\eggs.txt')      # from, to

>>> os.remove(r'C:\temp\spam.txt')                           # delete file?
WindowsError: [Error 2] The system cannot find the file specified: 'C:\\temp\\...'

>>> os.remove(r'C:\temp\eggs.txt')

The os.rename call used here changes a file’s name; the os.remove file deletion call deletes a file from your system and is synonymous with os.unlink (the latter reflects the call’s name on Unix but was obscure to users of other platforms).[10] The os module also exports the stat system call:

>>> open('spam.txt', 'w').write('Hello stat world\n')        # +1 for \r added
>>> import os
>>> info = os.stat(r'C:\temp\spam.txt')
>>> info
nt.stat_result(st_mode=33206, st_ino=0, st_dev=0, st_nlink=0, st_uid=0, st_gid=0,
st_size=18, st_atime=1267645806, st_mtime=1267646072, st_ctime=1267645806)

>>> info.st_mode, info.st_size                  # via named-tuple item attr names
(33206, 18)

>>> import stat
>>> info[stat.ST_MODE], info[stat.ST_SIZE]      # via stat module presets
(33206, 18)
>>> stat.S_ISDIR(info.st_mode), stat.S_ISREG(info.st_mode)
(False, True)

The os.stat call returns a tuple of values (really, in 3.X a special kind of tuple with named items) giving low-level information about the named file, and the stat module exports constants and functions for querying this information in a portable way. For instance, indexing an os.stat result on offset stat.ST_SIZE returns the file’s size, and calling stat.S_ISDIR with the mode item from an os.stat result checks whether the file is a directory. As shown earlier, though, both of these operations are available in the os.path module, too, so it’s rarely necessary to use os.stat except for low-level file queries:

>>> path = r'C:\temp\spam.txt'
>>> os.path.isdir(path), os.path.isfile(path), os.path.getsize(path)
(False, True, 18)

File Scanners

Before we leave our file tools survey, it’s time for something that performs a more tangible task and illustrates some of what we’ve learned so far. Unlike some shell-tool languages, Python doesn’t have an implicit file-scanning loop procedure, but it’s simple to write a general one that we can reuse for all time. The module in Example 4-1 defines a general file-scanning routine, which simply applies a passed-in Python function to each line in an external file.

Example 4-1. PP4E\System\Filetools\

def scanner(name, function):
    file = open(name, 'r')               # create a file object
    while True:
        line = file.readline()           # call file methods
        if not line: break               # until end-of-file
        function(line)                   # call a function object

The scanner function doesn’t care what line-processing function is passed in, and that accounts for most of its generality—it is happy to apply any single-argument function that exists now or in the future to all of the lines in a text file. If we code this module and put it in a directory on the module search path, we can use it any time we need to step through a file line by line. Example 4-2 is a client script that does simple line translations.

Example 4-2. PP4E\System\Filetools\

from sys import argv
from scanfile import scanner
class UnknownCommand(Exception): pass

def processLine(line):                      # define a function
    if line[0] == '*':                      # applied to each line
        print("Ms.", line[1:-1])
    elif line[0] == '+':
        print("Mr.", line[1:-1])            # strip first and last char: \n
        raise UnknownCommand(line)          # raise an exception

filename = 'data.txt'
if len(argv) == 2: filename = argv[1]       # allow filename cmd arg
scanner(filename, processLine)              # start the scanner

The text file hillbillies.txt contains the following lines:

*Elly May
+"Uncle Jed"

and our commands script could be run as follows:

C:\...\PP4E\System\Filetools> python hillbillies.txt
Ms. Granny
Mr. Jethro
Ms. Elly May
Mr. "Uncle Jed"

This works, but there are a variety of coding alternatives for both files, some of which may be better than those listed above. For instance, we could also code the command processor of Example 4-2 in the following way; especially if the number of command options starts to become large, such a data-driven approach may be more concise and easier to maintain than a large if statement with essentially redundant actions (if you ever have to change the way output lines print, you’ll have to change it in only one place with this form):

commands = {'*': 'Ms.', '+': 'Mr.'}     # data is easier to expand than code?

def processLine(line):
        print(commands[line[0]], line[1:-1])
    except KeyError:
        raise UnknownCommand(line)

The scanner could similarly be improved. As a rule of thumb, we can also usually speed things up by shifting processing from Python code to built-in tools. For instance, if we’re concerned with speed, we can probably make our file scanner faster by using the file’s line iterator to step through the file instead of the manual readline loop in Example 4-1 (though you’d have to time this with your Python to be sure):

def scanner(name, function):
    for line in open(name, 'r'):         # scan line by line
        function(line)                   # call a function object

And we can work more magic in Example 4-1 with the iteration tools like the map built-in function, the list comprehension expression, and the generator expression. Here are three minimalist’s versions; the for loop is replaced by map or a comprehension, and we let Python close the file for us when it is garbage collected or the script exits (these all build a temporary list of results along the way to run through their iterations, but this overhead is likely trivial for all but the largest of files):

def scanner(name, function):
    list(map(function, open(name, 'r')))

def scanner(name, function):
    [function(line) for line in open(name, 'r')]

def scanner(name, function):
    list(function(line) for line in open(name, 'r'))

File filters

The preceding works as planned, but what if we also want to change a file while scanning it? Example 4-3 shows two approaches: one uses explicit files, and the other uses the standard input/output streams to allow for redirection on the command line.

Example 4-3. PP4E\System\Filetools\

import sys

def filter_files(name, function):         # filter file through function
    input  = open(name, 'r')              # create file objects
    output = open(name + '.out', 'w')     # explicit output file too
    for line in input:
        output.write(function(line))      # write the modified line
    output.close()                        # output has a '.out' suffix

def filter_stream(function):              # no explicit files
    while True:                           # use standard streams
        line = sys.stdin.readline()       # or: input()
        if not line: break
        print(function(line), end='')     # or: sys.stdout.write()

if __name__ == '__main__':
    filter_stream(lambda line: line)      # copy stdin to stdout if run

Notice that the newer context managers feature discussed earlier could save us a few lines here in the file-based filter of Example 4-3, and also guarantee immediate file closures if the processing function fails with an exception:

def filter_files(name, function):
    with open(name, 'r') as input, open(name + '.out', 'w') as output:
        for line in input:
            output.write(function(line))      # write the modified line

And again, file object line iterators could simplify the stream-based filter’s code in this example as well:

def filter_stream(function):
    for line in sys.stdin:                    # read by lines automatically
        print(function(line), end='')

Since the standard streams are preopened for us, they’re often easier to use. When run standalone, it simply parrots stdin to stdout:

C:\...\PP4E\System\Filetools> < hillbillies.txt
*Elly May
+"Uncle Jed"

But this module is also useful when imported as a library (clients provide the line-processing function):

>>> from filters import filter_files
>>> filter_files('hillbillies.txt', str.upper)
>>> print(open('hillbillies.txt.out').read())

We’ll see files in action often in the remainder of this book, especially in the more complete and functional system examples of Chapter 6. First though, we turn to tools for processing our files’ home.

[9] For instance, to process pipes, described in Chapter 5. The Python os.pipe call returns two file descriptors, which can be processed with os module file tools or wrapped in a file object with os.fdopen. When used with descriptor-based file tools in os, pipes deal in byte strings, not text. Some device files may require lower-level control as well.

[10] For related tools, see also the shutil module in Python’s standard library; it has higher-level tools for copying and removing files and more. We’ll also write directory compare, copy, and search tools of our own in Chapter 6, after we’ve had a chance to study the directory tools presented later in this chapter.

The best content for your career. Discover unlimited learning on demand for around $1/day.