Sitemap files may also be zipped, and end in a .gz extension. This is because it likely contains many URLs and the compression will save a lot of space. While the code we used does not process gzip sitemap files, it is easy to add this using functions in the gzip library.
Scrapy also provides a facility for starting crawls using the sitemap. One of these is a specialization of the Spider class, SitemapSpider. This class has the smarts to parse the sitemap for you, and then start following the URLs. To demonstrate, the script 05/03_sitemap_scrapy.py will start the crawl at the nasa.gov top-level sitemap index:
import scrapyfrom scrapy.crawler import CrawlerProcessclass Spider(scrapy.spiders.SitemapSpider): name = 'spider'