The recipe works as follows:
- The script starts by calling get_sitemap():
map = sitemap.get_sitemap("https://www.nasa.gov/sitemap.xml")
- This is given a URL to the sitemap.xml file (or any other file - non-gzipped). The implementation simply gets the content at the URL and returns it:
def get_sitemap(url): get_url = requests.get(url) if get_url.status_code == 200: return get_url.text else: print ('Unable to fetch sitemap: %s.' % url)
- The bulk of the work is done by passing that content to parse_sitemap(). In the case of nasa.gov, this sitemap contains the following content, a sitemap index file:
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//www.nasa.gov/sitemap.xsl"?><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> ...