Breadth-first crawling

Breadth-first crawling is when priority is given to finding new domains and spreading out as far as possible, as opposed to continuing through a single domain in a depth-first manner.

Writing a breadth-first crawler will be left as an exercise for the reader based on the information provided in this chapter. It is not very different from the depth-first crawler in the previous section, except that it should prioritize URLs that point to domains that have not been seen before.

There are a couple of notes to keep in mind. If you're not careful and you don't set a maximum limit, you could potentially end up crawling petabytes of data! You might choose to ignore subdomains, or you can enter a site that has infinite subdomains ...

Get Security with Go now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.