Gap
Gap has a well structured website with a Sitemap
to help web crawlers locate their updated content. If we use the techniques from Chapter 1, Introduction to Web Scraping, to investigate a website, we would find their robots.txt
file at http://www.gap.com/robots.txt, which contains a link to this Sitemap:
Sitemap: http://www.gap.com/products/sitemap_index.xml
Here are the contents of the linked Sitemap
file:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://www.gap.com/products/sitemap_1.xml</loc> <lastmod>2015-03-03</lastmod> </sitemap> <sitemap> <loc>http://www.gap.com/products/sitemap_2.xml</loc> <lastmod>2015-03-03</lastmod> </sitemap> </sitemapindex>
As shown here, ...
Get Web Scraping with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.