Collecting Genesis Words in Perl

Our first script, makeGenesisTags.pl, produces a list of the words that appear in the book of Genesis in the Bible. The data is retrieved from the copy of the book of Genesis at the Project Gutenberg web site. To run the script, enter this command:

makeGenesisTags.pl

It will produce a file called genesis.pl. This script uses LWP::Simple to screen-scrape the Project Gutenberg web site. Let's see how it works by examining the script:

#!/usr/bin/perl

use HTTP::Cache::Transparent;
use LWP::Simple;
use Data::Dumper;

use strict;
use warnings;

These lines insure that the HTTP::Cache::Transparent, LWP::Simple, and Data::Dumper modules are available. If they aren't, you'll see an error message when you run the script that says something like "Can't locate Data/Dumper.pm in @INC."

use strict;
use warnings;

The above lines turn on strict warnings that help you avoid misspelled variable names and other common problems in your script.

$Data::Dumper::Terse= 1;  # avoids $VAR1 = * ; in dumper output

This line prevents Data::Dumper from prefixing its output with the boilerplate text $VAR1 = . This allows us to save the data to different variable names.

HTTP::Cache::Transparent::init( {
  BasePath => './cache',
  NoUpdate => 30*60
} );

The HTTP::Cache::Transparent module provides a simple way to make screen-scraping scripts more efficient. When you read data from a web site, a copy of the data is kept in a cached file. Subsequent reads will use the cached data rather than pulling ...

Get Building Tag Clouds in Perl and PHP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.