O'Reilly logo

Perl & LWP by Sean M. Burke

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Example: BBC News

In Chapter 7, we considered the task of extracting the headline link URLs from the BBC News main page, and we implemented it in terms of HTML::TokeParser. Here, we'll consider the same problem from the perspective of HTML::TreeBuilder.

To review the problem: when you look at the source of http://news.bbc.co.uk, you discover that each headline link is wrapped in one of two kinds of code. There are a lot of headlines expressed with code like this:

<B CLASS="h3"><A href="/hi/english/business/newsid_1576000/1576290.stm">Bank
of England mulls rate cut</A></B><BR>
  
<B CLASS="h3"><A href="/hi/english/uk_politics/newsid_1576000/1576541.stm">Euro
battle revived by Blair speech</A></B><BR>

and some headlines expressed with code like this:

<A href="/hi/english/business/newsid_1576000/1576636.stm">
  <B class="h2"> Swissair shares wiped out</B><BR>
</A>

<A href="/hi/english/world/middle_east/newsid_1576000/1576113.stm">
  <B class="h1">Mid-East blow to US anti-terror drive</B><BR>
</A>

(Note that in this second case, the B element's class value can be h1 or h2.)

In both cases, we can find what we want by first looking for B elements. We then look for the href attribute either on the A element that's a child of this B element, or on the A element that's this B element's parent. Whether we look for a parent A node or a child A node depends on the class attribute of the B element. To make sure we're on the right track, we can code up something to formalize our idea of what sorts of nodes ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required