Chapter 9. Programming Google

Hacks 92-100

When search engines first appeared on the scene, they were more open to being spidered, scraped, and aggregated. Sites like Excite and AltaVista didn’t worry too much about the odd surfer using Perl to grab a slice of a page or meta-search engines including their results in aggregated search results. Sure, egregious data suckers might get shut out, but the search engines weren’t worried about sharing their information on a smaller scale.

Google never took that stance. Instead, they have regularly prohibited meta-search engines from using their content without a license, and they try their best to block unidentified web agents like Perl’s LWP::Simple module or even wget on the command line. Google has even been known to block IP address ranges for running automated queries.

Google had every right to do this; after all, it was their search technology, database, and computer power. Unfortunately, however, these policies meant that casual researchers and Google nuts, like you and I, didn’t have the ability to play with their rich dataset in any automated way.

Google changed all that with the release of the Google Web API (http://api.google.com/) in the spring of 2002. The Google Web API doesn’t allow you to do every kind of search possible—for example, it doesn’t support the phonebook:syntax—but it does make available Google’s eight-billion-page web database so that developers can create their own interfaces and use Google search results to their ...

Get Google Hacks, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.