O'Reilly logo

Perl & LWP by Sean M. Burke

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Example: A Link-Checking Spider

So far in the book, we've produced little single-use programs that are for specific tasks. In this section, we will diverge from that approach by walking through the development of a Type Three Requester robot whose internals are modular enough that with only minor modification, it could be used as any sort of Type Three or Type Four Requester.

The Basic Spider Logic

The specific task for our program is checking all the links in a given web site. This means spidering the site, i.e., requesting every page in the site. To do that, we request a page in the site (or a few pages), then consider each link on that page. If it's a link to somewhere offsite, we should just check it. If it's a link to a URL that's in this site, we will not just check that the URL is retrievable, but in fact retrieve it and see what links it has, and so on, until we have gotten every page on the site and checked every link.

So, for example, if I start the spider out at http://www.mybalalaika.com/oggs/, it will request that page, get back HTML, and analyze that HTML for links. Suppose that page contains only three links:

http://bazouki-consortium.int/
http://www.mybalalaika.com/oggs/studio_credits.html
http://www.mybalalaika.com/oggs/plinky.ogg

We can tell that the first URL is not part of this site; in fact, we will define "site" in terms of URLs, so a URL is part of this site if starts with this site's URL. So because http://bazouki-consortium.int doesn't start with http://www.mybalalaika.com/oggs/ ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required