Search Engines and CGI

Most webmasters will be passionately anxious that their creations are properly indexed by the search engines on the Web, so that the teeming millions may share the delights they offer. At the time of writing, the search engines were coming under a good deal of criticism for being slow, inaccurate, arbitrary, and often plain wrong. One of the more serious criticisms alleged that sites that offered large numbers of separate pages produced by scripts from databases (in other words, most of the serious e-commerce sites) were not being properly indexed. According to one estimate, only 1 page in 500 would actually be found. This invisible material is often called “The Dark Web.”

The Netcraft survey of June 2000 visited about 16 million web sites. At the same time Google claimed to be the most comprehensive search engine with 2 million sites indexed. This meant that, at best, only one site in nine could then be found via the best search engine. Perhaps wisely, Google now does not claim a number of sites. Instead it claims (as of August, 2001) to index 1,387,529,000 web pages. Since the Netcraft survey for July 2001 showed 31 million sites (http://www.netcraft.com/Survey/Reports/200107/graphs.html), the implication is that the average site has only 44 pages — which seems too few by a long way and suggests that a lot of sites are not being indexed at all.

The reason seems to be that the search engines spend most of their time and energy fighting off “spam” — attempts ...

Get Apache: The Definitive Guide, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.