Chapter 15. A Web Spider in One Line

Tkil

One day, someone on the IRC #perl channel was asking some confused questions. We finally managed to figure out that he was trying to write a web robot, or “spider,” in Perl. Which is a grand idea, except that:

  1. Perfectly good spiders have already been written and are freely available at http://info.webcrawler.com/mak/projects/robots/robots.html.

  2. A Perl-based web spider is probably not an ideal project for novice Perl programmers. They should work their way up to it.

Having said that, I immediately pictured a one-line Perl robot. It wouldn’t do much, but it would be amusing. After a few abortive attempts, I ended up with this monster, which requires Perl 5.005. I’ve split it onto separate lines for easier reading.

perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe '
    $ua = LWP::UserAgent->new;
    while (my $link = shift @ARGV) {
        print STDERR "working on $link";
        HTML::LinkExtor->new(
          sub {
            my ($t, %a) = @_;
            my @links = map { url($_, $link)->abs( ) }
                       grep { defined } @a{qw/href img/};
            print STDERR "+ $_" foreach @links;
            push @ARGV, @links;
          } ) -> parse(
           do {
               my $r = $ua->simple_request
                 (HTTP::Request->new("GET", $link));
               $r->content_type eq "text/html" ? $r->content : "";
        }
     )
  }'http://slinky.scrye.com/~tkil/

I actually edited this on a single line; I use shell-mode inside of Emacs, so it wasn’t that much of a terror. Here’s the one-line version.

perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe '$ua = LWP::UserAgent->new; while (my $link = shift ...

Get Web, Graphics & Perl/Tk Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.