Crawling using the sitemap

A sitemap is a protocol that allows a webmaster to inform search engines about URLs on a website that are available for crawling.  A webmaster would want to use this as they actually want their information to be crawled by a search engine. The webmaster wants to make that content available for you to find, at least through search engines. But you can also use this information to your advantage.

A sitemap lists the URLs on a site, and allows a webmasters to specify additional information about each URL:

  • When it was last updated
  • How often the content changes
  • How important the URL is in relation to others

Sitemaps are useful on websites where:

  • Some areas of the website are not available through the browsable interface; ...

Get Python Web Scraping Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.