How it works

The spider is defined as the following:

class Spider(scrapy.spiders.SitemapSpider):    name = 'spider'    sitemap_urls = ['https://www.nasa.gov/sitemap.xml']    def parse(self, response):        print("Parsing: ", response)        print (response.request.meta.get('redirect_urls'))

This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page.

The crawling process is configured with the following code:

process = CrawlerProcess({    'LOG_LEVEL': 'DEBUG',    'DOWNLOADER_MIDDLEWARES':        {            "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500        },    'REDIRECT_ENABLED': True,    'REDIRECT_MAX_TIMES': 2})

Get Python Web Scraping Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.