There is a working example in the 06/10_file_cache.py script. In Scrapy, caching middleware is disabled by default. To enable this cache, set HTTPCACHE_ENABLED to True and HTTPCACHE_DIR to a directory on the file system (using a relative path will create the directory in the project's data folder). To demonstrate, this script runs a crawl of the NASA site, and caches the content. It is configured using the following:
if __name__ == "__main__": process = CrawlerProcess({ 'LOG_LEVEL': 'CRITICAL', 'CLOSESPIDER_PAGECOUNT': 50, 'HTTPCACHE_ENABLED': True, 'HTTPCACHE_DIR': "." }) process.crawl(Spider) process.start()
We ask Scrapy to cache using files and to create a sub-directory in the current folder. We also instruct it to limit ...