So far you've managed to retrieve pages that match the queries, but the order in which they are returned is simply the order in which they were crawled. In a large set of pages, you would be stuck sifting through a lot of irrelevant content for any mention of each of the query terms in order to find the pages that are really related to your search. To address this issue, you need ways to give pages a score for a given query, as well as the ability to return them with the highest scoring results first.
This section will look at several ways to calculate a score based only on the query and the content of the page. These scoring metrics include:
The number of times the words in the query appear in the document can help determine how relevant the document is.
The main subject of a document will probably appear near the beginning of the document.
If there are multiple words in the query, they should appear close together in the document.
The earliest search engines often worked with only these types of metrics and were able to give usable results. Later sections will cover ways to improve results with information external to the page, such as the number and quality of incoming links.
First, you'll need a new method that will take a query, get the
rows, put them in a dictionary, and display them in a formatted list.
Add these functions to your
def getscoredlist(self,rows,wordids): totalscores=dict([(row,0) for row in rows])
# This is where you'll later put the scoring functions
weights=for (weight,scores) in weights: for url in totalscores: totalscores[url]+=weight*scores[url] return totalscores def geturlname(self,id): return self.con.execute( "select url from urllist where rowid=%d" % id).fetchone( ) def query(self,q): rows,wordids=self.getmatchrows(q) scores=self.getscoredlist(rows,wordids) rankedscores=sorted([(score,url) for (url,score) in scores.items( )],reverse=1) for (score,urlid) in rankedscores[0:10]: print '%f\t%s' % (score,self.geturlname(urlid))
Right now the query method doesn't apply any scoring to the results, but it does display the URLs along with a placeholder for their scores:
>> e.query('functional programming')0.000000 http://kiwitobes.com/wiki/XSLT.html 0.000000 http://kiwitobes.com/wiki/XQuery.html 0.000000 http://kiwitobes.com/wiki/Unified_Modeling_Language.html ...
The important function here is
getscoredlist, which you'll be filling in
throughout this section. As you add scoring functions, you can add calls
weights list (the line in
bold) and start to get some real scores.
All the scoring methods introduced here return dictionaries of the URL IDs and a numerical score. To complicate things, sometimes a larger score is better and sometimes a smaller score is better. In order to compare the results from different methods, you need a way to normalize them; that is, to get them all within the same range and direction.
The normalization function will take a dictionary of IDs and scores and return a new dictionary with the same IDs, but with scores between 0 and 1. Each score is scaled according to how close it is to the best result, which will always have a score of 1. All you have to do is pass the function a list of scores and indicate whether a lower or higher score is better:
def normalizescores(self,scores,smallIsBetter=0): vsmall=0.00001 # Avoid division by zero errors if smallIsBetter: minscore=min(scores.values( )) return dict([(u,float(minscore)/max(vsmall,l)) for (u,l) \ in scores.items( )]) else: maxscore=max(scores.values( )) if maxscore==0: maxscore=vsmall return dict([(u,float(c)/maxscore) for (u,c) in scores.items( )])
Each of the scoring functions calls this function to normalize its results and return a value between 0 and 1.
The word frequency metric scores a page based on how many times the words in the query appear on that page. If I search for "python," I'd rather get a page about Python (or pythons) with many mentions of the word, and not a page about a musician who happens to mention near the end that he has a pet python.
The word frequency function looks like this. You can add it to
def frequencyscore(self,rows): counts=dict([(row,0) for row in rows]) for row in rows: counts[row]+=1 return self.normalizescores(counts)
This function creates a dictionary with an entry for every unique URL ID in rows, and counts how many times each item appears. It then normalizes the scores (bigger is better, in this case) and returns the result.
To activate frequency scoring in your results, change the
weights line in
getscoredlist to read:
Now you can try another search and see how well this works as a scoring metric:
>> e.query('functional programming')1.000000 http://kiwitobes.com/wiki/Functional_programming.html 0.262476 http://kiwitobes.com/wiki/Categorical_list_of_programming_languages.html 0.062310 http://kiwitobes.com/wiki/Programming_language.html 0.043976 http://kiwitobes.com/wiki/Lisp_programming_language.html 0.036394 http://kiwitobes.com/wiki/Programming_paradigm.html ...
This returns the page on "Functional programming" in first place, followed by several other relevant pages. Notice that "Functional programming" scored four times better than the result directly below it. Most search engines don't report scores to end users, but these scores can be very useful for some applications. For instance, you might want to take the user directly to the top result if it exceeds a certain threshold, or display results in a font size proportional to the relevance of the result.
Another simple metric for determining a page's relevance to a query is the search term's location in the page. Usually, if a page is relevant to the search term, it will appear closer to the top of the page, perhaps even in the title. To take advantage of this, the search engine can score results higher if the query term appears early in the document. Fortunately for us, when the pages were indexed earlier, the locations of the words were recorded, and the title of the page is first in the list.
Add this method to
def locationscore(self,rows): locations=dict([(row,1000000) for row in rows]) for row in rows: loc=sum(row[1:]) if loc<locations[row]: locations[row]=loc return self.normalizescores(locations,smallIsBetter=1)
Remember that the first item in each
row element is the URL ID, followed by the
locations of all the different search terms. Each ID can appear
multiple times, once for every combination of locations. For each row,
the method sums the locations of all the words and determines how this
result compares to the best result for that URL so far. It then passes
the final results to the
function. Note that
means that the URL with the lowest location sum gets a score of
To see what the results look like using only the location score,
weights line to
Now try the query again in your interpreter:
>> e.query('functional programming')
You'll notice that "Functional programming" is still the winner, but the other top results are now examples of functional programming languages. The previous search returned results in which the words were mentioned several times, but these tended to be discussions about programming languages in general. With this search, however, the presence of the words in the opening sentence (e.g., "Haskell is a standardized pure functional programming language") gave them a much higher score.
It's important to realize that neither one of the metrics shown
so far is better in every case. Both of these lists are valid
depending on the searcher's intent, and different combinations of
weights are required to give the best results for a particular set of
documents and applications. You can try experimenting with different
weights for the two metrics by changing your
weights line to something like this:
Experiment with different weights and queries and see how your results are affected.
Location is a more difficult metric to cheat than word frequency, since page authors can only put one word first in a document and repeating it doesn't make any difference to the results.
When a query contains multiple words, it is often useful to seek results in which the words in the query are close to each other in the page. Most of the time, when people make multiple-word queries, they are interested in a page that conceptually relates the different words. This is a little looser than the quoted-phrase searches supported by most search engines where the words must appear in the correct order with no additional words—in this case, the metric will tolerate a different order and additional words between the query words.
looks pretty similar to
def distancescore(self,rows): # If there's only one word, everyone wins! if len(rows)<=2: return dict([(row,1.0) for row in rows]) # Initialize the dictionary with large values mindistance=dict([(row,1000000) for row in rows]) for row in rows:
dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))])if dist<mindistance[row]: mindistance[row]=dist return self.normalizescores(mindistance,smallIsBetter=1)
The main difference here is that when the function loops through the locations (on the line shown in bold), it takes the difference between each location and the previous location. Since every combination of distances is returned by the query, it is guaranteed to find the smallest total distance.
You can try the word distance metric by itself if you like, but
it really works better when combined with other metrics. Try adding
distancescore to the
weights list and changing the numbers to see
how it affects the results of different queries.