Disk cache

To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists the limitations for some popular filesystems:

Operating system

File system

Invalid filename characters

Maximum filename length

Linux

Ext3/Ext4

/ and \0

255 bytes

OS X

HFS Plus

: and \0

255 UTF-16 code units

Windows

NTFS

\, /, ?, :, *, ", >, <, and |

255 characters

To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, basic punctuation, and replace all other characters with an underscore, as shown in the following code:

>>> import re >>> url = 'http://example.webscraping.com/default/view/Australia-1' ...

Get Web Scraping with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.