Putting It All Together

Let’s take stock of what we’ve done so far. We’ve written a script that will descend recursively through a filesystem, reading in the contents of any HTML files it encounters and extracting all the <A HREF="..."> and <IMG SRC="..."> attributes from those files. We’ve also created a subroutine that will take a directory name and a list of links extracted from a file in that directory, identify which links point to local files, and convert them to full (that is, absolute) filesystem pathnames.

The fast-but-stupid version of our link-checker is almost finished. The main thing left is defining the data structure that will hold the information on the bad links it discovers.

For that, we go back to the top of the script, just below the configuration section, and add the following:

my %bad_links;    # A "hash of arrays" with keys consisting of URLs
                  # under $start_base, and values consisting of lists 
                  # of bad links on those pages.

my %good;         # A hash mapping filesystem paths to
                  # 0 or 1 (for good or bad). Used to cache the results
                  # of previous checks so they needn't be repeated for
                  # subsequent pages.

Here we’ve declared two new hashes that are going to be used in our script: %bad_links and %good . %good is fairly straightforward; we’re going to use it to store the result of testing the links our script processes. The keys of the %good hash are the local filesystem paths for the files we are checking (e.g., /w1/s/socalsail/index.html). A link that turns out to be bad ...

Get Perl for Web Site Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.