Data flow in Scrapy

The data flow in Scrapy is controlled by the execution engine and goes like this:

  1. The process starts with locating the chosen spider and opening the first URL from the list of start_urls.
  2. The first URL is then scheduled as a request in a scheduler. This is more of an internal to Scrapy.
  3. The Scrapy engine then looks for the next set of URLs to crawl.
  4. The scheduler then sends the next URLs to the engine and the engine then forwards it to the downloader using the downloaded middleware. These middlewares are where we place different proxies and user-agent settings.
  5. The downloader downloads the response from the page and passes it to the spider, where the parse method selects specific elements from the response.
  6. Then, the spider sends ...

Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.