Checking Remote Links

Example 11-3 shows link_check2.plx , an enhanced version of the link-checking script that gives us the option of checking offsite links. The parts of this script that differ from the previous version have been emphasized.

Example 11-3. Link-checking script with offsite checking

#!/usr/bin/perl -w

# link_check2.plx

# This is a modified HTML link checker.
# It descends recursively from $start_dir, processing
# all .htm or .html files to extract HREF and SRC
# attributes, then checks all that point to a local
# file to confirm that the file actually exists, and optionally
# uses LWP::Simple to do a HEAD check on remote ones for the
# same purpose. It then reports on the bad links.

use strict;
use File::Find;
use LWP::Simple;

# note: the first four configuration variables should *not*
# have a trailing slash (/)

my $start_dir   = '/w1/s/socalsail/expo'; # where to begin looking
my $hostname    = 'www.socalsail.com';    # this site's hostname
my $web_root    = '/w1/s/socalsail';      # path to www doc root
my $web_path    = '/expo';                # web path to $start_dir
my $webify       = 1;                     # produce web-ready output?
my $check_remote = 1;                     # check offsite links? my %bad_links; # a "hash of lists" with keys consisting of filenames, # values consisting of lists of bad links in those files my %good; # A hash mapping absolute filenames (or remote URLs) to # 0 or 1 (for good or bad). Used to cache the results of # previous checks. find(\&process, $start_dir); # this loads up the above hashes if ($webify) ...

Get Perl for Web Site Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.