How to do it

There is a working example in the 06/10_file_cache.py script. In Scrapy, caching middleware is disabled by default. To enable this cache, set HTTPCACHE_ENABLED to True and HTTPCACHE_DIR to a directory on the file system (using a relative path will create the directory in the project's data folder). To demonstrate, this script runs a crawl of the NASA site, and caches the content. It is configured using the following:

if __name__ == "__main__":    process = CrawlerProcess({        'LOG_LEVEL': 'CRITICAL',        'CLOSESPIDER_PAGECOUNT': 50,        'HTTPCACHE_ENABLED': True,        'HTTPCACHE_DIR': "."    })    process.crawl(Spider)    process.start()

We ask Scrapy to cache using files and to create a sub-directory in the current folder. We also instruct it to limit ...

Get Python Web Scraping Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.