O'Reilly logo

PHP Hacks by Jack D. Herrington

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Hack #44. Scrape Web Pages for Data

Use regular expressions to scrape data from sources like Metacritic.

What do you do when you want the data from a site, but the site won't let you export that data in a predictable format (like XML [Hack #38] or CSV [Hack #43] )? One popular option is to perform what's called a screen scrape on the HTML to extract the data. Screen scraping starts with downloading the contents of the page containing the data into either a string in memory or a file. Regular expressions are then used to extract the relevant data from the string or file.

You can scrape almost any web site for data; for the example in this hack, I chose the Metacritic DVD review page (http://www.metacritic.com/video/).

The resulting generated PHP

Figure 5-9. The resulting generated PHP

Metacritic is a site where movies, music, and video games are given a review score based on a selection of reviews. Figure 5-10 shows the Metacritic page that I scraped for this hack. On the lefthand side of the window is a list of movies ordered by name, along with their review scores.

I can tell from the size of the page that I want only a small portion of the HTML. I use View Source to see what the code looks like, and indeed there is a section for these scores well defined by a div tag that contains what I'm looking for:

	</TR>
	</TABLE><DIV ID="sortbyname1"> <P CLASS="listing"> <SPAN CLASS="yellow">51</SPAN> <A HREF="/video/titles/800bullets">800 ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required