The Sitemap spider
If the site provides sitemap.xml
, then a better way to crawl the site is to use SiteMapSpider
instead.
Here, given sitemap.xml
, the spider parses the URLs provided by the site itself. This is a more polite way of crawling and good practice:
>>>from scrapy.contrib.spiders import SitemapSpider >>>class MySpider(SitemapSpider): >>> sitemap_URLss = ['http://www.example.com/sitemap.xml'] >>> sitemap_rules = [('/electronics/', 'parse_electronics'), ('/apparel/', 'parse_apparel'),] >>> def 'parse_electronics'(self, response): >>> # you need to create an item for electronics, >>> return >>> def 'parse_apparel'(self, response): >>> #you need to create an item for apparel >>> return
In the preceding code, we wrote one parse method for ...
Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.