Gap

Gap has a well structured website with a Sitemap to help web crawlers locate their updated content. If we use the techniques from Chapter 1, Introduction to Web Scraping, to investigate a website, we would find their robots.txt file at http://www.gap.com/robots.txt, which contains a link to this Sitemap:

Sitemap: http://www.gap.com/products/sitemap_index.xml

Here are the contents of the linked Sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://www.gap.com/products/sitemap_1.xml</loc>
        <lastmod>2015-03-03</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.gap.com/products/sitemap_2.xml</loc>
        <lastmod>2015-03-03</lastmod>
    </sitemap>
</sitemapindex>

As shown here, ...

Get Web Scraping with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Web Scraping with Python by Richard Lawson

Gap

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly