#41: Creating a Screen Scraper

A screen scraper program accesses a web page and picks through the HTML for interesting or useful data. Here's a very simple one that extracts all hyperlinks from a page and then categorizes them. This scraper includes a lot of regular expressions, so let's take it one step at a time. First, let's check that the input (in $_REQUEST["page"]) is actually a hyperlink and not someone trying to monkey around with files on the local system:

<?php
$page = $_REQUEST["page"];
if (!preg_match('|^https{0,1}://|', $page)) {
    print "URL $page invalid or unsupported.";
    exit;
}

Let's say that this checks out, so now it's time to get the data and extract all of the hyperlinks in the anchor tags (see the example on Matching and Extracting ...

Get Wicked Cool PHP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.