O'Reilly logo

Perl & LWP by Sean M. Burke

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Processing

Once you have parsed some HTML, you need to process it. Exactly what you do will depend on the nature of your problem. Two common models are extracting information and producing a transformed version of the HTML (for example, to remove banner advertisements).

Whether extracting or transforming, you'll probably want to find the bits of the document you're interested in. They might be all headings, all bold italic regions, or all paragraphs with class="blinking". HTML::Element provides several functions for searching the tree.

Methods for Searching the Tree

In scalar context, these methods return the first node that satisfies the criteria. In list context, all such nodes are returned. The methods can be called on the root of the tree or any node in it.

$node->find_by_tag_name( tag [, ...])

Return node(s) for tags of the names listed. For example, to find all h1 and h2 nodes:

@headings = $root->find_by_tag_name('h1', 'h2');
$node->find_by_attribute( attribute, value )

Returns the node(s) with the given attribute set to the given value. For example, to find all nodes with class="blinking":

@blinkers = $root->find_by_attribute("class",
"blinking");
$node->look_down(...)$node->look_up(...)

These two methods search $node and its children (and children's children, and so on) in the case of look_down, or its parent (and the parent's parent, and so on) in the case of look_up, looking for nodes that match whatever criteria you specify. The parameters are either attribute => value pairs (where ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required