So far in the book, we've produced little single-use programs that are for specific tasks. In this section, we will diverge from that approach by walking through the development of a Type Three Requester robot whose internals are modular enough that with only minor modification, it could be used as any sort of Type Three or Type Four Requester.
The specific task for our program is checking all the links in a given web site. This means spidering the site, i.e., requesting every page in the site. To do that, we request a page in the site (or a few pages), then consider each link on that page. If it's a link to somewhere offsite, we should just check it. If it's a link to a URL that's in this site, we will not just check that the URL is retrievable, but in fact retrieve it and see what links it has, and so on, until we have gotten every page on the site and checked every link.
So, for example, if I start the spider out at http://www.mybalalaika.com/oggs/, it will request that page, get back HTML, and analyze that HTML for links. Suppose that page contains only three links:
http://bazouki-consortium.int/ http://www.mybalalaika.com/oggs/studio_credits.html http://www.mybalalaika.com/oggs/plinky.ogg
We can tell that the first URL is not part of this site; in fact, we will define "site" in terms of URLs, so a URL is part of this site if starts with this site's URL. So because http://bazouki-consortium.int doesn't start with http://www.mybalalaika.com/oggs/ ...