Chapter 13. HTML::Parser

Ken MacFarlane

Tip

Since the original publication of this article, the HTML::Parser module has continued to evolve (version evolved (Version 3.25 as of this update), enabling one you to write develop powerful parsing tools with a minimum of coding. For those readers who are using this wonderful tool for the first time, the examples here should provide the means and feel for basic HTML parsing techniques, which can then be further extended to meet one’s needs. This article may also be useful for those new to object-oriented programming (I once was myself!) as it covers the concept of subclassing.

Perl is often used to manipulate the HTML files constituting web pages. For instance, one common task is removing tags from an HTML file to extract the plain text. Many solutions for such tasks usually use regular expressions, which often end up complicated, unattractive, and incomplete (or wrong). The alternative, described here, is to use the HTML::Parser module available on CPAN. HTML::Parser is an object-oriented module, and so it requires some extra explanation for casual users.

HTML::Parser works by scanning HTML input, and breaks it up into segments by how the text would be interpreted by a browser. For instance, this input: input would be broken up into three segments: a start tag (<A HREF=“index.html”>), text (This is a link), and an end tag (</A>).

<A HREF="index.html">This is a link</A>

As each segment is detected, the parser passes it to an appropriate subroutine. ...

Get Web, Graphics & Perl/Tk Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.