O'Reilly logo

Mining the Web by Soumen Chakrabarti

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

CHAPTER 8 RESOURCE DISCOVERY

General-purpose crawlers take a centralized, snapshot view of what is essentially a completely distributed hypermedium in uncontrolled flux. They seek to collect and process the entire contents of the Web in a centralized location, where it can be indexed in advance to be able to respond to any possible query. Meanwhile, the Web, already having two billion pages, keeps growing and changing to make centralized processing more difficult. An estimated 600 GB worth of pages changed per month in 1997 alone [120].

In its initial days, most of the Web could be collected by small- to medium-scale crawlers. From 1996 to 1999, coverage was a very stiff challenge: from an estimated coverage of 35% in 1997 [16], crawlers dropped ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required