How it works

  1. The script begins by importing reppy.robots:
from reppy.robots import Robots
  1. The code then uses Robots to fetch the robots.txt for amazon.com.
url = "http://www.amazon.com"robots = Robots.fetch(url + "/robots.txt")
  1. Using the content that was fetched, the script checks several URLs for accessibility:
paths = [    '/',    '/gp/dmusic/',    '/gp/dmusic/promotions/PrimeMusic/',    '/gp/registry/wishlist/']for path in paths: print("{0}: {1}".format(robots.allowed(path, '*'), url + path))

The results of this code is the following:

True: http://www.amazon.com/False: http://www.amazon.com/gp/dmusic/True: http://www.amazon.com/gp/dmusic/promotions/PrimeMusic/False: http://www.amazon.com/gp/registry/wishlist/

The call to robots.allowed is given ...

Get Python Web Scraping Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.