Extracting

At this point, we are ready to move on to the next level: having the script extract just the links from those files, or more specifically, having it extract the values of all the SRC and HREF attributes.

Warning

As was discussed in Chapter 4, trying to parse HTML files with simple pattern matching is an inherently error-prone undertaking. The accompanying example fails in the face of several kinds of HTML markup that are perfectly valid as HTML, but break the simplistic assumptions in this script. For a “correct” link checker that will handle those variations more gracefully, see the example at the end of this chapter.

We begin by deleting the line from the end of the &process subroutine that prints out the current filename and the entire contents of the $data variable, and replacing it with the following chunk of code:

my @targets = ($data =~ /(?:href|src)\s*=\s*"([^"]+)"/gi);
print "In file $file, found the following targets:\n";
foreach (@targets) {
    print " $_\n";
}

Let’s concentrate on that first line. It looks challenging, but assuming you’ve been doing your regular expressions homework it’s really not that tough.

The first thing to focus on is the regular expression search pattern itself: /(?:href|src)\s*=\s*"([^"]+)"/gi. In order, from left to right, this pattern says to match a string that begins with either href or src, then has zero or more whitespace characters, then an equal sign (=), then zero or more whitespace characters, then a doublequote ("), then ...

Get Perl for Web Site Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.