O'Reilly logo

Search-Based Applications by Laura Wilber, Gregory Grefenstette

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

CHAPTER 5

Data Collection/Population

At A Glance

Characteristic

Search Engine

Databases

Primary method

Crawlers

Direct writes, ETL (connectors)

Pre-processing

Not required

Required

Data freshness

Quasi-real-time

24hrs+ for data warehouses

5.1    SEARCH ENGINES

5.1.1  COLLECTION

Early Web search engines used a single primary tool to collect data, a software program called a crawlers [Heydon and Najork, 1999]. The crawler would connect to a website, capture the text it contained along with basic metadata like page titles, content headers or sub-headers, etc. (sending the information collected back to a central server(s) for indexing), and then follow the hyperlinks from one page to the next in an unending circuit across the Web. Aside ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required