HTML::TokeParser
As we said, you should use a subclassed HTML parser if you want a better interface to HTML parsing features than what HTML::Parser gives you. HTML::TokeParser by Gisle Aas is one such example. While HTML::TokeParser is actually a subclass of HTML::PullParser, it can help you do many useful things, such as link extraction and HTML checking.
In short, HTML::TokeParser breaks an HTML document into
tokens, attributes, and content, in which the HTML <a href="http://url">link</a>
would break down as:
token: a attrib: href content: http://url content: link token /a
For example, you can use HTML::TokeParser to extract links from a string that contains HTML:
#!/usr/local/bin/perl -w require HTML::TokeParser; # Our string that turns out to be HTML! my $html = '<p>Some text. <a href="http://blah"My name is Nate!</a></p>'; my $parser = HTML::TokeParser->new(\$html); get_tag( ) tells TokeParser to match a tag by name while (my $token = $parser->get_tag("a")) { my $url = $token->[1]{href} || "-"; my $text = $parser->get_trimmed_text("/a"); print "URL is: $url.\nURL text is: $text.\n"; }
Get Perl in a Nutshell, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.