HTML::TokeParser

As we said, you should use a subclassed HTML parser if you want a better interface to HTML parsing features than what HTML::Parser gives you. HTML::TokeParser by Gisle Aas is one such example. While HTML::TokeParser is actually a subclass of HTML::PullParser, it can help you do many useful things, such as link extraction and HTML checking.

In short, HTML::TokeParser breaks an HTML document into tokens, attributes, and content, in which the HTML <a href="http://url">link</a> would break down as:

token: a
    attrib: href
content: http://url
content: link
token /a

For example, you can use HTML::TokeParser to extract links from a string that contains HTML:

#!/usr/local/bin/perl -w

require HTML::TokeParser;

# Our string that turns out to be HTML!
my $html = '<p>Some text. <a href="http://blah"My name is Nate!</a></p>';
my $parser = HTML::TokeParser->new(\$html);

get_tag(  ) tells TokeParser to match a tag by name
while (my $token = $parser->get_tag("a")) {
    my $url = $token->[1]{href} || "-";
    my $text = $parser->get_trimmed_text("/a");
    print "URL is: $url.\nURL text is: $text.\n";
}

Get Perl in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.