Perl in a Nutshell, 2nd Edition

HTML::TokeParser

As we said, you should use a subclassed HTML parser if you want a better interface to HTML parsing features than what HTML::Parser gives you. HTML::TokeParser by Gisle Aas is one such example. While HTML::TokeParser is actually a subclass of HTML::PullParser, it can help you do many useful things, such as link extraction and HTML checking.

In short, HTML::TokeParser breaks an HTML document into tokens, attributes, and content, in which the HTML <a href="http://url">link</a> would break down as:

token: a
    attrib: href
content: http://url
content: link
token /a

For example, you can use HTML::TokeParser to extract links from a string that contains HTML:

#!/usr/local/bin/perl -w

require HTML::TokeParser;

# Our string that turns out to be HTML!
my $html = '<p>Some text. <a href="http://blah"My name is Nate!</a></p>';
my $parser = HTML::TokeParser->new(\$html);

get_tag(  ) tells TokeParser to match a tag by name
while (my $token = $parser->get_tag("a")) {
    my $url = $token->[1]{href} || "-";
    my $text = $parser->get_trimmed_text("/a");
    print "URL is: $url.\nURL text is: $text.\n";
}

Get Perl in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Perl in a Nutshell, 2nd Edition by Nathan Patwardhan, Ellen Siever, Stephen Spainhour

HTML::TokeParser

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly