Use regular expressions to scrape data from sources like Metacritic.
What do you do when you want the data from a site, but the site won't let you export that data in a predictable format (like XML [Hack #38] or CSV [Hack #43] )? One popular option is to perform what's called a screen scrape on the HTML to extract the data. Screen scraping starts with downloading the contents of the page containing the data into either a string in memory or a file. Regular expressions are then used to extract the relevant data from the string or file.
You can scrape almost any web site for data; for the example in this hack, I chose the Metacritic DVD review page (http://www.metacritic.com/video/).
Figure 5-9. The resulting generated PHP
Metacritic is a site where movies, music, and video games are given a review score based on a selection of reviews. Figure 5-10 shows the Metacritic page that I scraped for this hack. On the lefthand side of the window is a list of movies ordered by name, along with their review scores.
I can tell from the size of the page that I want only a small
portion of the HTML. I use View Source to see what the code looks like,
and indeed there is a section for these scores well defined by a
div tag that contains what I'm
<DIV ID="sortbyname1"> <P CLASS="listing"> <SPAN CLASS="yellow">51</SPAN> <A HREF="/video/titles/800bullets">800 ...