Chapter 16. Web Crawling in Parallel

Web crawling is fast. At least, it’s usually much faster than hiring a dozen interns to copy data from the internet by hand! Of course, the progression of technology and the hedonic treadmill demand that at a certain point even this will not be “fast enough.” That’s the point at which people generally start to look toward distributed computing.

Unlike most other technology fields, web crawling cannot often be improved simply by “throwing more cycles at the problem.” Running one process is fast; running two processes is not necessarily twice as fast. Running three processes might get you banned from the remote server you’re hammering on with all your requests!

However, in some situations parallel web crawling, or running parallel threads/processes, can still be of benefit:

  • Collecting data from multiple sources (multiple remote servers) instead of just a single source

  • Performing long/complex operations on the collected data (such as doing image analysis or OCR) that could be done in parallel with fetching the data

  • Collecting data from a large web service where you are paying for each query, or where creating multiple connections to the service is within the bounds of your usage agreement

Processes versus Threads

Python supports both multiprocessing and multithreading. Both multiprocessing and multithreading achieve the same ultimate goal: performing two programming tasks at the same time instead of running the program in a more traditional ...

Get Web Scraping with Python, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.