Chapter 6. Tree Traversal

This chapter introduces the application we will develop during the rest of the book, a web search engine. I describe the elements of a search engine and introduce the first application, a web crawler that downloads and parses pages from Wikipedia. This chapter also presents a recursive implementation of depth-first search and an iterative implementation that uses a Java Deque to implement a “last in, first out” stack.

Search Engines

A web search engine, like Google Search or Bing, takes a set of search terms and returns a list of web pages that are relevant to those terms (I’ll discuss what “relevant” means later). You can read more at http://thinkdast.com/searcheng, but I’ll explain what you need as we go along.

The essential components of a search engine are:

Crawling

We’ll need a program that can download a web page, parse it, and extract the text and any links to other pages.

Indexing

We’ll need a data structure that makes it possible to look up a search term and find the pages that contain it.

Retrieval

And we’ll need a way to collect results from the index and identify pages that are most relevant to the search terms.

We’ll start with the crawler. The goal of a crawler is to discover and download a set of web pages. For search engines like Google and Bing, the goal is to find all web pages, but often crawlers are limited to a smaller domain. In our case, we will only read pages from Wikipedia.

As a first step, we’ll build a crawler that reads a Wikipedia ...

Get Think Data Structures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.