Web pages are designed to be easy for humans to read, not for programs. Humans are very flexible in what they can read, and they can easily adapt to a new look and feel of the web page. But if the underlying HTML changes, a program written to extract information from the page will no longer work. Your challenge when writing a data-extraction program is to get a feel for the amount of natural variation between pages you'll want to download.
The following are a set of techniques for you to use when creating regular expressions to extract data from web pages. If you're an experienced Perl programmer, you probably know most or all of them and can skip ahead to Section 6.3.
An important decision is how much surrounding text you put into your regular expression. Put in too much of this context and you run the risk of being too specific—the natural variation from page to page causes your program to fail to extract some information it should have been able to get. Similarly, put in too little context and you run the risk of your regular expression erroneously matching elsewhere on the page.
Many HTML pages have whitespace added to make the source easier to read or as a side effect of how they were produced. For example, notice the spaces around the number in this line:
<b>Amazon.com Sales Rank: </b> 4,070 </font><br>
Without checking, it's hard to guess whether every page has that space. You could check, or you could simply be flexible ...