O'Reilly logo

Learning Scrapy by Dimitrios Kouzis-Loukas

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Extracting more URLs

Up to now, we have been working with a single URL that we set in the spider's start_urls property. Since that's a tuple, we can hardcode multiple URLs, for example:

start_urls = (
    'http://web:9312/properties/property_000000.html',
    'http://web:9312/properties/property_000001.html',
    'http://web:9312/properties/property_000002.html',
)

Not that exciting. We might also use a file as the source of those URLs as follows:

start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]

This is not very exciting either, but it certainly works. What will happen more often that not is that the website of interest will have some index pages and some listing pages. For example, Gumtree has the following index pages: http://www.gumtree.com/flats-houses/london ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required