Chapter 8. Indexer

At this point we have built a basic web crawler; the next piece we will work on is the index. In the context of web search, an index is a data structure that makes it possible to look up a search term and find the pages where that term appears. In addition, we would like to know how many times the search term appears on each page, which will help identify the pages most relevant to the term.

For example, if a user submits the search terms “Java” and “programming”, we would look up both search terms and get two sets of pages. Pages with the word “Java” would include pages about the island of Java, the nickname for coffee, and the programming language. Pages with the word “programming” would include pages about different programming languages, as well as other uses of the word. By selecting pages with both terms, we hope to eliminate irrelevant pages and find the ones about Java programming.

Now that we understand what the index is and what operations it performs, we can design a data structure to represent it.

Data Structure Selection

The fundamental operation of the index is a lookup; specifically, we need the ability to look up a term and find all pages that contain it. The simplest implementation would be a collection of pages. Given a search term, we could iterate through the contents of the pages and select the ones that contain the search term. But the runtime would be proportional to the total number of words on all the pages, which would be way too slow. ...

Get Think Data Structures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.