Comparing Directory Trees

Engineers can be a paranoid sort (but you didn’t hear that from me). At least I am. It comes from decades of seeing things go terribly wrong, I suppose. When I create a CD backup of my hard drive, for instance, there’s still something a bit too magical about the process to trust the CD writer program to do the right thing. Maybe I should, but it’s tough to have a lot of faith in tools that occasionally trash files and seem to crash my Windows machine every third Tuesday of the month. When push comes to shove, it’s nice to be able to verify that data copied to a backup CD is the same as the original—or at least to spot deviations from the original—as soon as possible. If a backup is ever needed, it will be really needed.

Because data CDs are accessible as simple directory trees, we are once again in the realm of tree walkers—to verify a backup CD, we simply need to walk its top-level directory. If our script is general enough, we will also be able to use it to verify other copy operations as well—e.g., downloaded tar files, hard-drive backups, and so on. In fact, the combination of the cpall script we saw earlier and a general tree comparison would provide a portable and scriptable way to copy and verify data sets.

We’ve already written a generic walker class (the visitor module), but it won’t help us here directly: we need to walk two directories in parallel and inspect common files along the way. Moreover, walking either one of the two directories won’t ...

Get Programming Python, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.