Disk cache
To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists the limitations for some popular filesystems:
Operating system |
File system |
Invalid filename characters |
Maximum filename length |
---|---|---|---|
Linux |
Ext3/Ext4 |
/ and \0 |
255 bytes |
OS X |
HFS Plus |
: and \0 |
255 UTF-16 code units |
Windows |
NTFS |
\, /, ?, :, *, ", >, <, and | |
255 characters |
To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, basic punctuation, and replace all other characters with an underscore, as shown in the following code:
>>> import re >>> url = 'http://example.webscraping.com/default/view/Australia-1' ...
Get Web Scraping with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.