Chapter 15. Crawling Wikipedia

In this chapter, I present a solution to the previous exercise and analyze the performance of web indexing algorithms. Then we build a simple web crawler.

The Redis-Backed Indexer

In my solution, we store two kinds of structures in Redis:

  • For each search term, we have a URLSet, which is a Redis set of URLs that contain the search term.

  • For each URL, we have a TermCounter, which is a Redis hash that maps each search term to the number of times it appears.

We discussed these data types in the previous chapter. You can also read about Redis structures at http://thinkdast.com/redistypes.

In JedisIndex, I provide a method that takes a search term and returns the Redis key of its URLSet:

private String urlSetKey(String term) {
    return "URLSet:" + term;
}

And a method that takes a URL and returns the Redis key of its TermCounter:

private String termCounterKey(String url) {
    return "TermCounter:" + url;
}

Here’s the implementation of indexPage, which takes a URL and a jsoup Elements object that contains the DOM tree of the paragraphs we want to index:

public void indexPage(String url, Elements paragraphs) {
    System.out.println("Indexing " + url);

    // make a TermCounter and count the terms in the paragraphs
    TermCounter tc = new TermCounter(url);
    tc.processElements(paragraphs);

    // push the contents of the TermCounter to Redis
    pushTermCounterToRedis(tc);
}

To index a page, we

  1. Make a Java TermCounter for the contents of the page, using code from a previous exercise.

Get Think Data Structures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.