O'Reilly logo

Perl & LWP by Sean M. Burke

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Regular Expression Techniques

Web pages are designed to be easy for humans to read, not for programs. Humans are very flexible in what they can read, and they can easily adapt to a new look and feel of the web page. But if the underlying HTML changes, a program written to extract information from the page will no longer work. Your challenge when writing a data-extraction program is to get a feel for the amount of natural variation between pages you'll want to download.

The following are a set of techniques for you to use when creating regular expressions to extract data from web pages. If you're an experienced Perl programmer, you probably know most or all of them and can skip ahead to Section 6.3.

Anchor Your Match

An important decision is how much surrounding text you put into your regular expression. Put in too much of this context and you run the risk of being too specific—the natural variation from page to page causes your program to fail to extract some information it should have been able to get. Similarly, put in too little context and you run the risk of your regular expression erroneously matching elsewhere on the page.

Whitespace

Many HTML pages have whitespace added to make the source easier to read or as a side effect of how they were produced. For example, notice the spaces around the number in this line:

<b>Amazon.com Sales Rank: </b> 4,070 </font><br>

Without checking, it's hard to guess whether every page has that space. You could check, or you could simply be flexible ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required