O'Reilly logo

Programming Collective Intelligence by Toby Segaran

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Building the Index

The next step is to set up the database for the full-text index. As I mentioned earlier, the index is a list of all the different words, along with the documents in which they appear and their locations in the documents. In this example, you'll be looking at the actual text on the page and ignoring nontext elements. You'll also be indexing individual words with all the punctuation characters removed. The method for separating words is not perfect, but it will suffice for building a basic search engine.

Because covering different database software or setting up a database server is outside the scope of this book, this chapter will show you how to store the index using SQLite. SQLite is an embedded database that is very easy to set up and stores a whole database in one file. SQLite uses SQL for queries, so it shouldn't be too difficult to convert the sample code to use a different database. The Python implementation is called pysqlite, and you can download it from http://initd.org/tracker/pysqlite.

There is a Windows installer as well as instructions for installing it on other operating systems. Appendix A contains more information on getting and installing pysqlite.

Once you have SQLite installed, add this line to the start of searchengine.py:

from pysqlite2 import dbapi2 as sqlite

You'll also need to change the __init__, __del__, and dbcommit methods to open and close the database:

  def __init_  _(self,dbname):
    self.con=sqlite.connect(dbname)

  def __del_  _(self):
    self.con.close(  )

  def dbcommit(self):
    self.con.commit(  )

Setting Up the Schema

Don't run the code just yet—you still need to prepare the database. The schema for the basic index is five tables. The first table (urllist) is the list of URLs that have been indexed. The second table (wordlist) is the list of words, and the third table (wordlocation) is a list of the locations of words in the documents. The remaining two tables specify links between documents. The link table stores two URL IDs, indicating a link from one table to another, and linkwords uses the wordid and linkid columns to store which words are actually used in that link. The schema is shown in Figure 4-1.

Schema for the search engine

Figure 4-1. Schema for the search engine

All tables in SQLite have a field called rowid by default, so there's no need to explicitly specify an ID for these tables. To create a function for adding all the tables, add this code to the end of searchengine.py so that it's part of the crawler class:

  def createindextables(self):
    self.con.execute('create table urllist(url)')
    self.con.execute('create table wordlist(word)')
    self.con.execute('create table wordlocation(urlid,wordid,location)')
    self.con.execute('create table link(fromid integer,toid integer)')
    self.con.execute('create table linkwords(wordid,linkid)')
    self.con.execute('create index wordidx on wordlist(word)')
    self.con.execute('create index urlidx on urllist(url)')
    self.con.execute('create index wordurlidx on wordlocation(wordid)')
    self.con.execute('create index urltoidx on link(toid)')
    self.con.execute('create index urlfromidx on link(fromid)')
    self.dbcommit(  )

This function will create the schema for all the tables that you will be using, along with some indices to speed up searching. These indices are important, since the dataset can potentially get very large. Enter these commands in your Python session to create a database called searchindex.db:

>>reload(searchengine)
>> crawler=searchengine.crawler('searchindex.db')
>> crawler.createindextables(  )

Later you'll be adding an additional table to the schema for a scoring metric based on counting inbound links.

Finding the Words on a Page

The files that you're downloading from the Web are HTML and thus contain a lot of tags, properties, and other information that doesn't belong in the index. The first step is to extract all the parts of the page that are text. You can do this by searching the soup for text nodes and collecting all their content. Add this code to your gettextonly function:

    def gettextonly(self,soup):
      v=soup.string
      if v==None:
        c=soup.contents
        resulttext=''
        for t in c:
          subtext=self.gettextonly(t)
          resulttext+=subtext+'\n'
        return resulttext
      else:
        return v.strip(  )

The function returns a long string containing all the text on the page. It does this by recursively traversing down the HTML document object model, looking for text nodes. Text that was in separate sections is separated into different paragraphs. It's important to preserve the order of the sections for some of the metrics you'll be calculating later.

Next is the separatewords function, which splits a string into a list of separate words so that they can be added to the index. It's not as easy as you might think to do this perfectly, and there has been a lot of research into improving the technique. However, for these examples it will suffice to consider anything that isn't a letter or a number to be a separator. You can do this using a regular expression. Replace the definition of separatewords with the following:

  def separatewords(self,text):
    splitter=re.compile('\\W*')
    return [s.lower(  ) for s in splitter.split(text) if s!='']

Because this function considers anything nonalphanumeric to be a separator, it will have no problem extracting English words, but it won't properly handle terms like "C++" (no trouble searching for "python," though). You can experiment with the regular expression to make it work better for different kinds of searches.

Tip

Another possibility is to remove suffixes from the words using a stemming algorithm. These algorithms attempt to convert the words to their stems. For example, the word "indexing" becomes "index" so that people searching for the word "index" are also shown documents containing the word "indexing." To do this, stem the words while crawling documents and also stem the words in the search query. A full discussion of stemming is outside the scope of this chapter, but you can find a Python implementation of the well-known Porter Stemmer at http://www.tartarus.org/˜martin/PorterStemmer/index.html.

Adding to the Index

You're ready to fill in the code for the addtoindex method. This method will call the two functions that were defined in the previous section to get a list of words on the page. Then it will add the page and all the words to the index, and will create links between them with their locations in the document. For this example, the location will be the index within the list of words.

Here is the code for addtoindex:

  def addtoindex(self,url,soup):
    if self.isindexed(url): return
    print 'Indexing '+url

    # Get the individual words
    text=self.gettextonly(soup)
    words=self.separatewords(text)

    # Get the URL id
    urlid=self.getentryid('urllist','url',url)

    # Link each word to this url
    for i in range(len(words)):
      word=words[i]
      if word in ignorewords: continue
      wordid=self.getentryid('wordlist','word',word)
      self.con.execute("insert into wordlocation(urlid,wordid,location) \
        values (%d,%d,%d)" % (urlid,wordid,i))

You'll also need this to update the helper function getentryid. All this does is return the ID of an entry. If the entry doesn't exist, it is created and the ID is returned:

  def getentryid(self,table,field,value,createnew=True):
    cur=self.con.execute(
    "select rowid from %s where %s='%s'" % (table,field,value))
    res=cur.fetchone(  )
    if res==None:
      cur=self.con.execute(
      "insert into %s (%s) values ('%s')" % (table,field,value))
      return cur.lastrowid
    else:
      return res[0]

As you're crawling, you'll also want to be remembering which pages linked to each other. This will become important later when link-based scoring methods are introduced. Add the "addlinkref" method to your crawler class.

def addlinkref(self,urlFrom,urlTo,linkText):
  words=self.separateWords(linkText)
  fromid=self.getentryid('urllist','url',urlFrom)
  toid=self.getentryid('urllist','url',urlTo)
  if fromid==toid: return
  cur=self.con.execute("insert into link(fromid,toid) values (%d,%d)" % 
   (fromid,toid))
  linkid=cur.lastrowid
  for word in words:
    if word in ignorewords: continue
    wordid=self.getentryid('wordlist','word',word)
    self.con.execute("insert into linkwords(linkid,wordid) values (%d,%d)" % 
      (linkid,wordid))

Finally, you'll need to fill in the code for isindexed, which determines whether the page is already in the database, and if so, whether there are any words associated with it:

  def isindexed(self,url):
    u=self.con.execute \
      ("select rowid from urllist where url='%s'" % url).fetchone(  )
    if u!=None:
      # Check if it has actually been crawled
      v=self.con.execute(
      'select * from wordlocation where urlid=%d' % u[0]).fetchone(  )
      if v!=None: return True
    return False

Now you can rerun the crawler and have it actually index the pages as it goes. You can do this in your interactive session:

>>reload(searchengine)
>> crawler=searchengine.crawler('searchindex.db')
>> pages= \
.. ['http://kiwitobes.com/wiki/Categorical_list_of_programming_languages.html']
>> crawler.crawl(pages)

The crawler will probably take a long time to run. Instead of waiting for it to finish, I recommend that you download a preloaded copy of searchindex.db from http://kiwitobes.com/db/searchindex.db and save it in the directory with your Python code.

If you'd like to make sure that the crawl worked properly, you can try checking the entries for a word by querying the database:

>>[row for row in crawler.con.execute(
.. 'select rowid from wordlocation where wordid=1')]
[(1,), (46,), (330,), (232,), (406,), (271,), (192,),...

The list that is returned is the list of all the URL IDs containing "word," which means that you've successfully run a full-text search. This is a great start, but it will only work with one word at a time, and will just return the documents in the order in which they were loaded. The next section will show you how to expand this functionality by doing these searches with multiple words in the query.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required