Now, we could try excluding every kind of thing we know we don't
want. We could exclude the
link by excluding all URLs that start with mailto:;
we could exclude the guest bio URLs by excluding URLs that contain
guestinfo; we could exclude the "Previous" and
"Next" links by ignoring any URLs with dayFA in
them; and we could think of a way to exclude the image URLs. However,
tomorrow the people at Fresh Air might add this to
their general template:
<a href="buynow.html"><img alt="Buy the Terry Gross mug" src="/mug.jpg" width=450 weight=90></a>
Because that isn't explicitly excluded, it would make its way through and appear as a segment link in every program listed.
It is a valid approach to come up with criteria for the kinds of things we don't want to see, but it's usually easier to come up with criteria to capture what we do want to see. So this is what we'll do.
We could characterize the links we're after in several ways:
These links all contain a
<font...> ... </font> sequence
<b> ... </b>
They all have an
...> tag with an
href attribute pointing to a URL.
The URL they point to looks like http://www.npr.org/ramfiles/fa/20010702.fa.ram.
Notably, the URL's scheme is
http, it's on the server
www.npr.org, its path includes
ramfiles, and it ends in
The (trimmed) link text up to
/a always begins with
Listen to .
Now, of these, the first criterion is most reminiscent of the sort of things we did earlier with the BBC news extractor. But in this case, ...