Chapter 4. Persistence: Saving data to files

It is truly great to be able to process your file-based data. But what happens to your data when you’re done? Of course, it’s best to save your data to a disk file, which allows you to use it again at some later date and time. Taking your memory-based data and storing it to disk is what persistence is all about. Python supports all the usual tools for writing to files and also provides some cool facilities for efficiently storing Python data. So...flip the page and let’s get started learning them.

Programs produce data

It’s a rare program that reads data from a disk file, processes the data, and then throws away the processed data. Typically, programs save the data they process, display their output on screen, or transfer data over a network.

Before you learn what’s involved in writing data to disk, let’s process the data from the previous chapter to work out who said what to whom.

When that’s done, you’ll have something worth saving.

Code Magnets

Add the code magnets at the bottom of this page to your existing code to satisfy the following requirements:

Create an empty list called man.
Create an empty list called other.
Add a line of code to remove unwanted whitespace from the line_spoken variable.
Provide the conditions and code to add line_spoken to the correct list based on the value of role.
Print each of the lists (man and other) to the screen.

____________
____________
try:
    data = open('sketch.txt')
    for each_line in data:
        try:
            (role, line_spoken) = each_line.split(':', 1)
            _____________________________________________
            _____________________
                   __________________________________
            _____________________
                   __________________________________
        except ValueError:
            pass
    data.close()
except IOError:
    print('The datafile is missing!')

____________
____________

Open your file in write mode

When you use the open() BIF to work with a disk file, you can specify an access mode to use. By default, open() uses mode r for reading, so you don’t need to specify it. To open a file for writing, use mode w:

By default, the print() BIF uses standard output (usually the screen) when displaying data. To write data to a file instead, use the file argument to specify the data file object to use:

When you’re done, be sure to close the file to ensure all of your data is written to disk. This is known as flushing and is very important:

Geek Bits

When you use access mode w, Python opens your named file for writing. If the file already exists, it is cleared of its contents, or clobbered. To append to a file, use access mode a, and to open a file for writing and reading (without clobbering), use w+. If you try to open a file for writing that does not already exist, it is first created for you, and then opened for writing.

Brain Power

Consider the following carefully: what happens to your data files if the second call to print() in your code causes an IOError?

Files are left open after an exception!

When all you ever do is read data from files, getting an IOError is annoying, but rarely dangerous, because your data is still in your file, even though you might be having trouble getting at it.

It’s a different story when writing data to files: if you need to handle an IOError before a file is closed, your written data might become corrupted and there’s no way of telling until after it has happened.

Your exception-handling code is doing its job, but you now have a situation where your data could potentially be corrupted, which can’t be good.

What’s needed here is something that lets you run some code regardless of whether an IOError has occured. In the context of your code, you’ll want to make sure the files are closed no matter what.

Extend try with finally

When you have a situation where code must always run no matter what errors occur, add that code to your try statement’s finally suite:

If no runtime errors occur, any code in the finally suite executes. Equally, if an IOError occurs, the except suite executes and then the finally suite runs.

No matter what, the code in the finally suite always runs.

By moving your file closing code into your finally suite, you are reducing the possibility of data corruption errors.

This is a big improvement, because you’re now ensuring that files are closed properly (even when write errors occur).

But what about those errors?

How do you find out the specifics of the error?

There are no Dumb Questions

Q:
Q: I’m intrigued. When you stripped the line_spoken data of unwanted whitespace, you assigned the result back to the line_spoken variable. Surely invoking the strip() method on line_spoken changed the string it refers to?
A:
A: No, that’s not what happens. Strings in Python are immutable, which means that once a string is created, it cannot be changed.
Q:
Q: But you did change the line_spoken string by removing any unwanted whitespace, right?
A:
A: Yes and no. What actually happens is that invoking the strip() method on the line_spoken string creates a new string with leading and trailing whitespace removed. The new string is then assigned to line_spoken, replacing the data that was referred to before. In effect, it is as if you changed line_spoken, when you’ve actually completely replaced the data it refers to.
Q:
Q: So what happens to the replaced data?
A:
A: Python’s built-in memory management technology reclaims the RAM it was using and makes it available to your program. That is, unless some other Python data object is also referring to the string.
Q:
Q: What? I don’t get it.
A:
A: It is conceivable that another data object is referring to the string referred to by line_spoken. For example, let’s assume you have some code that contains two variables that refer to the same string, namely “Flying Circus.” You then decide that one of the variables needs to be in all UPPERCASE, so you invoke the upper() method on it. The Python interperter takes a copy of the string, converts it to uppercase, and returns it to you. You can then assign the uppercase data back to the variable that used to refer to the original data.
Q:
Q: And the original data cannot change, because there’s another variable referring to it?
A:
A: Precisely. That’s why strings are immutable, because you never know what other variables are referring to any particular string.
Q:
Q: But surely Python can work out how many variables are referring to any one particular string?
A:
A: It does, but only for the purposes of garbage collection. If you have a line of code like print('Flying Circus'), the string is not referred to by a variable (so any variable reference counting that’s going on isn’t going to count it) but is still a valid string object (which might be referred to by a variable) and it cannot have its data changed under any circumstances.
Q:
Q: So Python variables don’t actually contain the data assigned to them?
A:
A: That’s correct. Python variables contain a reference to a data object. The data object contains the data and, because you can conceivably have a string object used in many different places throughout your code, it is safest to make all strings immutable so that no nasty side effects occur.
Q:
Q: Isn’t it a huge pain not being able to adjust strings “in place”?
A:
A: No, not really. Once you get used to how strings work, it becomes less of an issue. In practice, you’ll find that this issue rarely trips you up.
Q:
Q: Are any other Python data types immutable?
A:
A: Yes, a few. There’s the tuple, which is an immutable list. Also, all of the number types are immutable.
Q:
Q: Other than learning which is which, how will I know when something is immutable?
A:
A: Don’t worry: you’ll know. If you try to change an immutable value, Python raises a TypeError exception.
Q:
Q: Of course: an exception occurs. They’re everywhere in Python, aren’t they?
A:
A: Yes. Exceptions make the world go ’round.

Knowing the type of error is not enough

When a file I/O error occurs, your code displays a generic “File Error” message. This is too generic. How do you know what actually happened?

Who knows?

It turns out that the Python interpreter knows...and it will give up the details if only you’d ask.

When an error occurs at runtime, Python raises an exception of the specific type (such as IOError, ValueError, and so on). Additionally, Python creates an exception object that is passed as an argument to your except suite.

Let’s use IDLE to see how this works.

An IDLE Session

Let’s see what happens when you try to open a file that doesn’t exist, such as a disk file called missing.txt. Enter the following code at IDLE’s shell:

As the file doesn’t exist, the data file object wasn’t created, which subsequently makes it impossible to call the close() method on it, so you end up with a NameError. A quick fix is to add a small test to the finally suite to see if the data name exists before you try to call close(). The locals() BIF returns a collection of names defined in the current scope. Let’s exploit this BIF to only invoke close() when it is safe to do so:

Here you’re searching the collection returned by the locals() BIF for the string data. If you find it, you can assume the file was opened successfully and safely call the close() method.

If some other error occurs (perhaps something awful happens when your code calls the print() BIF), your exception-handling code catches the error, displays your “File error” message and, finally, closes any opened file.

But you still are none the wiser as to what actually caused the error.

When an exception is raised and handled by your except suite, the Python interpreter passes an exception object into the suite. A small change makes this exception object available to your code as an identifier:

But when you try to run your code with this change made, another exception is raised:

This time your error message didn’t appear at all. It turns out exception objects and strings are not compatible types, so trying to concatenate one with the other leads to problems. You can convert (or cast) one to the other using the str() BIF:

Now, with this final change, your code is behaving exactly as expected:

Of course, all this extra logic is starting to obscure the real meaning of your code.

Use with to work with files

Because the use of the try/except/finally pattern is so common when it comes to working with files, Python includes a statement that abstracts away some of the details. The with statement, when used with files, can dramatically reduce the amount of code you have to write, because it negates the need to include a finally suite to handle the closing of a potentially opened data file. Take a look:

When you use with, you no longer have to worry about closing any opened files, as the Python interpreter automatically takes care of this for you. The with code on the the right is identical in function to that on the left. At Head First Labs, we know which approach we prefer.

Geek Bits

The with statement takes advantage of a Python technology called the context management protocol.

Test Drive

Add your with code to your program, and let’s confirm that it continues to function as expected. Delete the two data files you created with the previous version of your program and then load your newest code into IDLE and give it a spin.

If you check your folder, your two data files should’ve reappeared. Let’s take a closer look at the data file’s contents by opening them in your favorite text editor (or use IDLE).

You’ve saved the lists in two files containing what the Man said and what the Other man said. Your code is smart enough to handle any exceptions that Python or your operating system might throw at it.

Well done. This is really coming along.

Default formats are unsuitable for files

Although your data is now stored in a file, it’s not really in a useful format. Let’s experiment in the IDLE shell to see what impact this can have.

Yikes! It would appear your list is converted to a large string by print() when it is saved. Your experimental code reads a single line of data from the file and gets all of the data as one large chunk of text...so much for your code saving your list data.

What are your options for dealing with this problem?

Geek Bits

By default, print() displays your data in a format that mimics how your list data is actually stored by the Python interpreter. The resulting output is not really meant to be processed further... its primary purpose is to show you, the Python programmer, what your list data “looks like” in memory.

Parsing the data in the file is a possibility...although it’s complicated by all those square brackets, quotes, and commas. Writing the required code is doable, but it is a lot of code just to read back in your saved data.

Of course, if the data is in a more easily parseable format, the task would likely be easier, so maybe the second option is worth considering, too?

Brain Power

Can you think of a function you created from earlier in this book that might help here?

Why not modify print_lol()?

Recall your print_lol() function from Chapter 2, which takes any list (or list of lists) and displays it on screen, one line at a time. And nested lists can be indented, if necessary.

This functionality sounds perfect! Here’s your code from the nester.py module (last seen at the end of Chapter 2):

Amending this code to print to a disk file instead of the screen (known as standard output) should be relatively straightforward. You can then save your data in a more usable format.

The Scholar’s Corner

Standard Output The default place where your code writes its data when the “print()” BIF is used. This is typically the screen. In Python, standard output is referred to as “sys.stdout” and is importable from the Standard Library’s “sys” module.

Exercise

Let’s add a fourth argument to your print_lol() function to identify a place to write your data to. Be sure to give your argument a default value of sys.stdout, so that it continues to write to the screen if no file object is specified when the function is invoked.
Fill in the blanks with the details of your new argument. (Note: to save on space, the comments have been removed from this cod, but be sure to update your comments in your nester.py module after you’ve amended your code.)
```
def print_lol(the_list, indent=False, level=0,____________________):

    for each_item in the_list:
        if isinstance(each_item, list):
            print_lol(each_item, indent, level+1,________)
        else:
            if indent:
                for tab_stop in range(level):
                    print("\t", end='',_________________)
            print(each_item,____________________)
```
What needs to happen to the code in your with statement now that your amended print_lol() function is available to you?
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
List the name of the module(s) that you now need to import into your program in order to support your amendments to print_lol().
___________________________________________________________________________
___________________________________________________________________________

Exercise Solution

You were to add a fourth argument to your print_lol() function to identify a place to write your data to, being sure to give your argument a default value of sys.stdout so that it continues to write to the screen if no file object is specified when the function is invoked.
You were to fill in the blanks with the details of your new argument. (Note: to save on space, the comments have been removed from this code, but be sure to update those in your nester.py module after you’ve amended your code).
What needs to happen to the code in your with statement now that your amended print_lol() function is available to you?
List the name of the module(s) that you now need to import into your program in order to support your amendments to print_lol().

Test Drive

Before taking your code for a test drive, you need to do the following:

Make the necessary changes to nester and install the amended module into your Python environment (see Chapter 2 for a refresher on this). You might want to upload to PyPI, too.
Amend your program so that it imports nester and uses print_lol() instead of print() within your with statement. Note: your print_lol() invocation should look something like this:
```
try:
print_lol(man, fh=man_file).
```

When you are ready, take your latest program for a test drive and let’s see what happens:

Let’s check the contents of the files to see what they look like now.

This is looking good. By amending your nester module, you’ve provided a facility to save your list data in a legible format. It’s now way easier on the eye.

But does this make it any easier to read the data back in?

That’s a good point.

This problem is not unlike the problem from the beginning of the chapter, in that you’ve got lines of text in a disk file that you need to process, only now you have two files instead of one.

You know how to write the code to process your new files, but writing custom code like this is specific to the format that you’ve created for this problem. This is brittle: if the data format changes, your custom code will have to change, too.

Ask yourself: is it worth it?

Custom Code Exposed

This week’s interview: When is custom code appropriate?

Head First: Hello, CC, how are you today?

Custom Code: Hi, I’m great! And when I’m not great, there’s always something I can do to fix things. Nothing’s too much trouble for me. Here: have a seat.

Head First: Why, thanks.

Custom Code: Let me get that for you. It’s my new custom SlideBack&Groove™, the 2011 model, with added cushions and lumbar support...and it automatically adjusts to your body shape, too. How does that feel?

Head First: Actually [relaxes], that feels kinda groovy.

Custom Code: See? Nothing’s too much trouble for me. I’m your “go-to guy.” Just ask; absolutely anything’s possible when it’s a custom job.

Head First: Which brings me to why I’m here. I have a “delicate” question to ask you.

Custom Code: Go ahead, shoot. I can take it.

Head First: When is custom code appropriate?

Custom Code: Isn’t it obvious? It’s always appropriate.

Head First: Even when it leads to problems down the road?

Custom Code: Problems?!? But I’ve already told you: nothing’s too much trouble for me. I live to customize. If it’s broken, I fix it.

Head First: Even when a readymade solution might be a better fit?

Custom Code: Readymade? You mean (I hate to say it): off the shelf?

Head First: Yes. Especially when it comes to writing complex programs, right?

Custom Code: What?!? That’s where I excel: creating beautifully crafted custom solutions for folks with complex computing problems.

Head First: But if something’s been done before, why reinvent the wheel?

Custom Code: But everything I do is custom-made; that’s why people come to me...

Head First: Yes, but if you take advantage of other coders’ work, you can build your own stuff in half the time with less code. You can’t beat that, can you?

Custom Code: “Take advantage”...isn’t that like exploitation?

Head First: More like collaboration, sharing, participation, and working together.

Custom Code: [shocked] You want me to give my code...away?

Head First: Well...more like share and share alike. I’ll scratch your back if you scratch mine. How does that sound?

Custom Code: That sounds disgusting.

Head First: Very droll [laughs]. All I’m saying is that it is not always a good idea to create everything from scratch with custom code when a good enough solution to the problem might already exist.

Custom Code: I guess so...although it won’t be as perfect a fit as that chair.

Head First: But I will be able to sit on it!

Custom Code: [laughs] You should talk to my buddy Pickle...he’s forever going on about stuff like this. And to make matters worse, he lives in a library.

Head First: I think I’ll give him a shout. Thanks!

Custom Code: Just remember: you know where to find me if you need any custom work done.

Pickle your data

Python ships with a standard library called pickle, which can save and load almost any Python data object, including lists.

Once you pickle your data to a file, it is persistent and ready to be read into another program at some later date/time:

You can, for example, store your pickled data on disk, put it in a database, or transfer it over a network to another computer.

When you are ready, reversing this process unpickles your persistent pickled data and recreates your data in its original form within Python’s memory:

Save with dump and restore with load

Using pickle is straightforward: import the required module, then use dump() to save your data and, some time later, load() to restore it. The only requirement when working with pickled files is that they have to be opened in binary access mode:

What if something goes wrong?

If something goes wrong when pickling or unpickling your data, the pickle module raises an exception of type PickleError.

Sharpen your pencil

Here’s a snippet of your code as it currently stands. Grab your pencil and strike out the code you no longer need, and then replace it with code that uses the facilities of pickle instead. Add any additional code that you think you might need, too.

try:
    with open('man_data.txt', 'w') as man_file, open('other_data.txt', 'w') as other_file:
        nester.print_lol(man, fh=man_file)
        nester.print_lol(other, fh=other_file)
except IOError as err:
    print('File error: ' + str(err))

There are no Dumb Questions

Q:
Q: When you invoked print_lol() earlier, you provided only two arguments, even though the function signature requires you to provide four. How is this possible?
A:
A: When you invoke a Python function in your code, you have options, especially when the function provides default values for some arguments. If you use positional arguments, the position of the argument in your function invocation dictates what data is assigned to which argument. When the function has arguments that also provide default values, you do not need to always worry about positional arguments being assigned values.
Q:
Q: OK, you’ve completely lost me. Can you explain?
A:
A: Consider print(), which has this signature: print(value, sep=' ', end='\n', file=sys.stdout). By default, this BIF displays to standard output (the screen), because it has an argument called file with a default value of sys.stdout. The file argument is the fourth positional argument. However, when you want to send data to something other than the screen, you do not need to (nor want to have to) include values for the second and third positional arguments. They have default values anyway, so you need to provide values for them only if the defaults are not what you want. If all you want to do is to send data to a file, you invoke the print() BIF like this: print("Dead Parrot Sketch", file='myfavmonty.txt') and the fourth positional argument uses the value you specify, while the other positional arguments use their defaults. In Python, not only do the BIFs work this way, but your custom functions support this mechamism, too.

Test Drive

Let’s see what happens now that your code has been amended to use the standard pickle module instead of your custom nester module. Load your amended code into IDLE and press F5 to run it.

So, once again, let’s check the contents of the files to see what they look like now:

It appears to have worked...but these files look like gobbledygook! What gives?

Recall that Python, not you, is pickling your data. To do so efficiently, Python’s pickle module uses a custom binary format (known as its protocol). As you can see, viewing this format in your editor looks decidedly weird.

Don’t worry: it is supposed to look like this.

An IDLE Session

pickle really shines when you load some previously pickled data into another program. And, of course, there’s nothing to stop you from using pickle with nester. After all, each module is designed to serve different purposes. Let’s demonstrate with a handful of lines of code within IDLE’s shell. Start by importing any required modules:

>>> import pickle
>>> import nester

No surprises there, eh?

Next up: create a new identifier to hold the data that you plan to unpickle. Create an empty list called new_man:

>>> new_man = []

Yes, almost too exciting for words, isn’t it? With your list created. let’s load your pickled data into it. As you are working with external data files, it’s best if you enclose your code with try/except:

>>> try:
        with open('man_data.txt', 'rb') as man_file:
                new_man = pickle.load(man_file)
except IOError as err:
        print('File error: ' + str(err))
except pickle.PickleError as perr:
        print('Pickling error: ' + str(perr))

This code is not news to you either. However, at this point, your data has been unpickled and assigned to the new_man list. It’s time for nester to do its stuff:

And to finish off, let’s display the first line spoken as well as the last:

Generic file I/O with pickle is the way to go!

Python takes care of your file I/O details, so you can concentrate on what your code actually does or needs to do.

As you’ve seen, being able to work with, save, and restore data in lists is a breeze, thanks to Python. But what other data structures does Python support out of the box?

Let’s dive into Chapter 5 to find out.

Your Python Toolbox

You’ve got Chapter 4 under your belt and you’ve added some key Python techiques to your toolbox.

Python Lingo

“Immutable types” - data types in Python that, once assigned a value, cannot have that value changed.
“Pickling” - the process of saving a data object to persistence storage.
“Unpickling” - the process of restoring a saved data object from persistence storage.

Bullet Points

The strip() method removes unwanted whitespace from strings.
The file argument to the print() BIF controls where data is sent/saved.
The finally suite is always executed no matter what exceptions occur within a try/except statement.
An exception object is passed into the except suite and can be assigned to an identifier using the as keyword.
The str() BIF can be used to access the stringed representation of any data object that supports the conversion.
The locals() BIF returns a collection of variables within the current scope.
The in operator tests for membership.
The “+” operator concatenates two strings when used with strings but adds two numbers together when used with numbers.
The with statement automatically arranges to close all opened files, even when exceptions occur. The with statement uses the as keyword, too.
sys.stdout is what Python calls “standard output” and is available from the standard library’s sys module.
The standard library’s pickle module lets you easily and efficiently save and restore Python data objects to disk.
The pickle.dump() function saves data to disk.
The pickle.load() function restores data from disk.

Get Head First Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Head First Python by

Chapter 4. Persistence: Saving data to files

Programs produce data

Open your file in write mode

Geek Bits

Brain Power

Files are left open after an exception!

Extend try with finally

Knowing the type of error is not enough

Use with to work with files

Geek Bits

Default formats are unsuitable for files

Geek Bits

Brain Power

Why not modify print_lol()?

The Scholar’s Corner

Pickle your data

Save with dump and restore with load

What if something goes wrong?

Generic file I/O with pickle is the way to go!

Your Python Toolbox

Python Lingo

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly