O'Reilly logo

Learning Scrapy by Dimitrios Kouzis-Loukas

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Changes to our spider and middleware

In order to build this system, we need to slightly modify our Scrapy spider and develop spider middleware. More specifically we will have to perform the following:

  • Fine tune crawling the index to perform at maximum speed
  • Write a middleware that batches and sends URLs to scrapyd servers
  • Use the same middleware to allow batch URLs to be used at start-up

We will try to implement these changes as unobtrusively as possible. Ideally, the whole operation should be clean, easy to understand, and transparent to the underlying spider code. This is an infrastructure-level requirement and hacking (potentially hundreds) of spiders to enable it is a bad idea.

Sharded-index crawling

Our first step is to optimize index crawling ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required