The HTML Module

The HTML modules provide an interface to parse HTML documents. After you parse the document, you can print or display it according to the markup tags, or you can extract specific information such as hyperlinks.

The HTML::Parser module provides the base class for the usable HTML modules. It provides methods for reading in HTML text from either a string or a file and then separating out the syntactic structures and data. As a base class, Parser does virtually nothing on its own. The other modules call it internally and override its empty methods for their own purposes. However, the HTML::Parser class is useful to you if you want to write your own classes for parsing and formatting HTML.

HTML::TreeBuilder is a class that parses HTML into a syntax tree. In a syntax tree, each element of the HTML, such as container elements with beginning and end tags, is stored relative to other elements. This preserves the nested structure and behavior of HTML and its hierarchy.

A syntax tree of the TreeBuilder class is formed of connected nodes that represent each element of the HTML document. These nodes are saved as objects from the HTML::Element class. An HTML::Element object stores all the information from an HTML tag: the start tag, end tag, attributes, plain text, and pointers to any nested elements.

The remaining classes of the HTML modules use the syntax trees and its nodes of element objects to output useful information from the HTML documents. The format classes, such as ...

Get Perl in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.