Setting the number of concurrent requests per domain

It is generally inefficient to crawl a site one URL at a time. Therefore, there is normally a number of simultaneous page requests made to the target site at any given time. Normally, the remote web server can quite effectively handle multiple simultaneous requests, and on your end you are just waiting for data to come back in for each, so concurrency generally works well for your scraper.

But this is also a pattern that smart websites can identify and flag as suspicious activity. And there are practical limits on both your crawler's end and the website. The more concurrent requests that are made, the more memory, CPU, network connections, and network bandwidth is required on both sides. ...

Get Python Web Scraping Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.