Track Additions to Yahoo!

Keep track of the number of sites added to your favorite Yahoo! categories.

Every day, a squad of surfers at Yahoo! adds new sites to the Yahoo! index. These changes are reflected in the Yahoo! What’s New page (http://dir.yahoo.com/new), along with the Picks of the Day.

If you’re a casual surfer, you might not care about the number of new sites added to Yahoo!. But there are several scenarios when you might have an interest:

You regularly glean information about new sites from Yahoo!

Knowing which categories are growing and which categories are stagnant will tell you where to direct your attention.

You want to submit sites to Yahoo!

Are you going to spend your hard-earned money adding a site to a category where new sites are added constantly (meaning your submitted site might quickly get buried)? Or will you be paying to add to a category that sees few additions (meaning your site might have a better chance of standing out)?

You’re interested in trend tracking

Which categories are consistently busy? Which are all but dead? By watching how Yahoo! adds sites to categories, over time you’ll get a sense of the rhythms and trends and detect when unusual activity occurs in a category.

This hack scrapes the recent counts of additions to Yahoo! categories and prints them out, providing an at-a-glance look at additions to various categories. You’ll also get a tab-delimited table of how many sites have been added to each category for each day. A tab-delimited file is excellent for importing into a spreadsheet, where you can turn the count numbers into a chart.

The Code

Save the following code to a file called hoocount.pl:

	#!/usr/bin/perl -w

	use strict;
	use Date::Manip;
	use LWP::Simple;
	use Getopt::Long;

	$ENV{TZ} = "GMT" if $^O eq "MSWin32";

	# the homepage for Yahoo!'s "What's New".
	my $new_url = "http://dir.yahoo.com/new/";
	
	# the major categories at Yahoo!. hash'd because
	# we'll use them to hold our counts string.
	my @categories = ("Arts & Humanities",		"Business & Economy",
				  "Computers & Internet",	"Education",
				  "Entertainment",				"Government",
				  "Health",				"News & Media",
				  "Recreation & Sports",	"Reference",
				  "Regional",					"Science",
				  "Social Science",				"Society & Culture");
	my %final_counts; # where we save our final readouts.

	# load in our options from the command line.
	my %opts; GetOptions(\%opts, "c|count=i");
	die unless $opts{c}; # count sites from past $i days.

	# if we've been told to count the number of new sites,
	# then we'll go through each of our main categories
	# for the last $i days and collate a result.

	# begin the header
	# for our import file.
	my $header = "Category";

	# from today, going backwards, get $i days.
	for (my $i=1; $i <= $opts{c}; $i++) {
	
		# create a Data::Manip time that will
		# be used to construct the last $i days	
		my $day; # query for Yahoo! retrieval.
		if ($i == 1) { $day = "yesterday"; }	
		else { $day = "$i days ago"; }
		my $date = UnixDate($day, "%Y%m%d");

		# and this date to
		# our import file.
		$header .= "\t$date";
		
		# and download the day.
		my $url = "$new_url$date.html";
		my $data = get($url) or die $!;
		
		# and loop through each of our categories.
		my $day_count; foreach my $category (sort @categories) {
			$data =~ /$category.*?(\d+)/; my $count = $1 || 0;
			$final_counts{$category} .= "\t$count"; # building our string.
		}
	}

	# with all our counts finished,
	# print out our final file.
	print $header . "\n";
	foreach my $category (@categories) {
		print $category, $final_counts{$category}, "\n";
	}

Running the Hack

The only argument you need to provide to the script is the number of days back you’d like it to travel in search of new additions. Since Yahoo! doesn’t archive its “new pages added” indefinitely, a safe upper limit is around two weeks. Here, we’re looking at the past two days:

	              % perl hoocount.pl --count 2
	Category		20050711		20050710
	Arts & Humanities			32		9
	Business & Economy			44		2
	Computers & Internet		30		0
	Education			0			0
	Entertainment		77			0
	Government			2			0
	Health	11			0
	News & Media	0			0
	Recreation & Sports			48		1
	Reference			0			0
	Regional			81			3
	Science 6			9
	Social Science		0			0
	Society & Culture			12		0

Hacking the Hack

If you’re not only a researcher but also a Yahoo! observer, you might be interested in how the number of sites added changes over time. To that end, you could run this script under cron or the Windows Scheduler and output the results to a file. After three months or so, you’d have a pretty interesting set of counts to manipulate with a spreadsheet program.

Kevin Hemenway and Tara Calishain

Get Yahoo! Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.