Chapter 10. Modifying HTML with Trees

In Chapter 9, we saw how to extract information from HTML trees. But that’s not the only thing you can use trees for. HTML::TreeBuilder trees can be altered and can even be written back out as HTML, using the as_HTML( ) method. There are four ways in which a tree can be altered: you can alter a node’s attributes; you can delete a node; you can detach a node and reattach it elsewhere; and you can add a new node. We’ll treat each of these in turn.

Changing Attributes

Suppose that in your new role as fixer of large sets of HTML documents, you are given a bunch of documents that have headings like this:

<h3 align=center>Free Monkey</h3>
<h3 color=red>Inquire Within</h3>

that need to be changed like this:

<h2 class=scream>Free Monkey</h2>
<h4 class=mutter>Inquire Within</h4>

Before you start phrasing this in terms of HTML::Element methods, you should consider whether this can be done with a search-and-replace operation in an editor. In this case, it cannot, because you’re not just changing every <h3 align=center> to <h2 class=scream> and every <h4 color=red> to <h3 class=mutter> (which are apparently simple search-and-replace operations), you also have to change </h3> to </h2> or to </h4>, depending on what you did to the element that it closes. That sort of context dependency puts this well outside the realm of simple search-and-replace operations. One could try to implement this with HTML::TokeParser, reading every token and printing it back out, after ...

Get Perl & LWP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.