Chapter 7. HTML Processing with Tokens

Regular expressions are powerful, but they’re a painfully low-level way of dealing with HTML. You’re forced to worry about spaces and newlines, single and double quotes, HTML comments, and a lot more. The next step up from a regular expression is an HTML tokenizer. In this chapter, we’ll use HTML::TokeParser to extract information from HTML files. Using these techniques, you can extract information from any HTML file, and never again have to worry about character-level trivia of HTML markup.

HTML as Tokens

Your experience with HTML code probably involves seeing raw text such as this:

<p>Dear Diary,
<br>I'm gonna be a superstar, because I'm learning to play
the <a href="http://MyBalalaika.com">balalaika</a> &amp; the <a
href='http://MyBazouki.com'>bazouki</a>!!!

The HTML::TokeParser module divides the HTML into units called tokens, which means units of parsing. The above source code is parsed as this series of tokens:

start-tag token

p with no attributes

text token

Dear Diary,\n

start-tag token

br with no attributes

text token

I'm gonna be a superstar, because I'm learning to play\nthe

start-tag token

a, with attribute href whose value is http://MyBalalaika.com

text token

balalaika

end-tag token

a

text token

&amp; the , which means & the

start-tag token

a, with attribute href equals http://MyBazouki.com

text token

bazouki

end-tag token

a

text token

!!!\n

This representation of things is more abstract, focusing on markup concepts and not individual characters. So whereas ...

Get Perl & LWP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.