Up to now, we have been working with a single URL that we set in the spider's
start_urls property. Since that's a tuple, we can hardcode multiple URLs, for example:
start_urls = ( 'http://web:9312/properties/property_000000.html', 'http://web:9312/properties/property_000001.html', 'http://web:9312/properties/property_000002.html', )
Not that exciting. We might also use a file as the source of those URLs as follows:
start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]
This is not very exciting either, but it certainly works. What will happen more often that not is that the website of interest will have some index pages and some listing pages. For example, Gumtree has the following index pages: http://www.gumtree.com/flats-houses/london ...