The scoring metrics discussed so far have all been based on the content of the page. Although many search engines still work this way, the results can often be improved by considering information that others have provided about the page, specifically, who has linked to the page and what they have said about it. This is particularly useful when indexing pages of dubious value or pages that might have been created by spammers, as these are less likely to be linked than pages with real content.
The crawler that you built at the beginning of the chapter already
captures all the important information about the links, so there's no
need to change it. The
has the URL IDs for the source and target of every link that it has
encountered, and the
connects the words with the links.
The easiest thing to do with inbound links is to count them on
each page and use the total number of links as a metric for the page.
Academic papers are often rated in this way, with their importance
tied to the number of other papers that reference them. The scoring
function below creates a dictionary of counts by querying the link
table for every unique URL ID in
rows, and then it returns the normalized
def inboundlinkscore(self,rows): uniqueurls=set([row for row in rows]) inboundcount=dict([(u,self.con.execute( \ 'select count(*) from link where toid=%d' % u).fetchone( )) \ for u in uniqueurls]) return self.normalizescores(inboundcount)
Obviously, using this metric by itself will simply return all the pages containing the search terms, ranked solely on how many inbound links they have. In the dataset, "Programming language" has many more inbound links than "Python," but you'd rather see "Python" first in the results if that's what you searched for. To combine relevance with ranking, you need to use the inbound-links metric in combination with one of the metrics shown earlier.
This algorithm also weights every inbound link equally, which, while nice and egalitarian, is open to manipulation because someone can easily set up several sites pointing to a page whose score they want to increase. It's also possible that people are more interested in results that have attracted the attention of very popular sites. Next, you'll see how to make links from popular pages worth more in calculating rankings.
The PageRank algorithm was invented by the founders of Google, and variations on the idea are now used by all the large search engines. This algorithm assigns every page a score that indicates how important that page is. The importance of the page is calculated from the importance of all the other pages that link to it and from the number of links each of the other pages has.
In theory, PageRank (named after one of its inventors, Larry Page) calculates the probability that someone randomly clicking on links will arrive at a certain page. The more inbound links the page has from other popular pages, the more likely it is that someone will end up there purely by chance. Of course, if the user keeps clicking forever, they'll eventually reach every page, but most people stop surfing after a while. To capture this, PageRank also uses a damping factor of 0.85, indicating that there is an 85 percent chance that a user will continue clicking on links at each page.
Figure 4-3 shows an example set of pages and links.
Pages B, C, and D all link to A, and they already have their
PageRanks calculated. B also links to three other pages and C links to
four other pages. D only links to A. To get A's PageRank, take the
PageRank (PR) of each of the pages that links to A divided by the
total number of links on that page, then multiply this by a damping
factor of 0.85, and add a minimum value of 0.15. The calculation for
PR(A) = 0.15 + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) + PR(D)/links(D) ) = 0.15 + 0.85 * ( 0.5/4 + 0.7/5 + 0.2/1 ) = 0.15 + 0.85 * ( 0.125 + 0.14 + 0.2) = 0.15 + 0.85 * 0.465 = 0.54525
You'll notice that D actually contributes more to A's PageRank than either B or C does, even though it has a lower PageRank of its own, because it links exclusively to A and is able to contribute its entire score.
Pretty easy, right? Well, there's a small catch—in this example, all the pages linking to A already had PageRanks. You can't calculate a page's score until you know the scores of all the pages that link there, and you can't calculate their scores without doing the same for all the pages that link to them. How is it possible to calculate PageRanks for a whole set of pages that don't already have PageRanks?
The solution is to set all the PageRanks to an initial arbitrary value (the code will use 1.0, but the actual value doesn't make any difference), and repeat the calculation over several iterations. After each iteration, the PageRank for each page gets closer to its true PageRank value. The number of iterations needed varies with the number of pages, but in the small set you're working with, 20 should be sufficient.
Because the PageRank is time-consuming to calculate and stays
the same no matter what the query is, you'll be creating a function
that precomputes the PageRank for every URL and stores it in a table.
This function will recalculate all the PageRanks every time it is run.
Add this function to the
def calculatepagerank(self,iterations=20): # clear out the current PageRank tables self.con.execute('drop table if exists pagerank') self.con.execute('create table pagerank(urlid primary key,score)') # initialize every url with a PageRank of 1 self.con.execute('insert into pagerank select rowid, 1.0 from urllist') self.dbcommit( ) for i in range(iterations): print "Iteration %d" % (i) for (urlid,) in self.con.execute('select rowid from urllist'): pr=0.15 # Loop through all the pages that link to this one for (linker,) in self.con.execute( 'select distinct fromid from link where toid=%d' % urlid): # Get the PageRank of the linker linkingpr=self.con.execute( 'select score from pagerank where urlid=%d' % linker).fetchone( ) # Get the total number of links from the linker linkingcount=self.con.execute( 'select count(*) from link where fromid=%d' % linker).fetchone( )
pr+=0.85*(linkingpr/linkingcount)self.con.execute( 'update pagerank set score=%f where urlid=%d' % (pr,urlid)) self.dbcommit( )
This function initially sets the PageRank of every page to 1.0. It then loops over every URL and gets the PageRank and the total number of links for every inbound link. The line in bold shows the formula being applied for each of the inbound links.
Running this function will take a few minutes, but you only need to do it when you update the index.
>> crawler.calculatepagerank( )Iteration 0 Iteration 1 ...
If you're curious about which pages from the example dataset have the highest PageRanks, you can query the database directly:
cur=crawler.con.execute('select * from pagerank order by score desc')
>> for i in range(3): print cur.next( )(438, 2.5285160000000002) (2, 1.1614640000000001) (543, 1.064252) >> e.geturlname(438) u'http://kiwitobes.com/wiki/Main_Page.html'
"Main Page" has the highest PageRank, which is not surprising
since every other page in Wikipedia links to it. Now that you have a
table of PageRank scores, using them is just a matter of creating a
function to retrieve them from the database and to normalize the
scores. Add this method to the
def pagerankscore(self,rows): pageranks=dict([(row,self.con.execute('select score from pagerank where urlid=%d' % row).fetchone( )) for row in rows]) maxrank=max(pageranks.values( )) normalizedscores=dict([(u,float(l)/maxrank) for (u,l) in pageranks.items( )]) return normalizedscores
Once again, you should modify the
weights list to include PageRank. For
weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.pagerankscore(rows))]
The results for your searches will take into account content and ranking scores. The results for "Functional programming" now look even better:
2.318146 http://kiwitobes.com/wiki/Functional_programming.html 1.074506 http://kiwitobes.com/wiki/Programming_language.html 0.517633 http://kiwitobes.com/wiki/Categorical_list_of_programming_languages.html 0.439568 http://kiwitobes.com/wiki/Programming_paradigm.html 0.426817http://kiwitobes.com/wiki/Lisp_programming_language.html
The value of the PageRank score is a little harder to see with this closed, tightly controlled set of documents, which likely contains fewer useless pages or pages intended solely to get attention than you'd find on the Web. However, even in this case, it's clear that PageRank is a useful metric for returning higher-level, more general pages.
Another very powerful way to rank searches is to use the text of the links to a page to decide how relevant the page is. Many times you will get better information from what the links to a page say about it than from the linking page itself, as site developers tend to include a short description of whatever it is they are linking to.
The method for scoring the pages by their link text takes an
additional argument, which is the list of word IDs produced when you
perform a query. You can add this method to
def linktextscore(self,rows,wordids): linkscores=dict([(row,0) for row in rows]) for wordid in wordids: cur=self.con.execute('select link.fromid,link.toid from linkwords,link where wordid=%d and linkwords.linkid=link.rowid' % wordid) for (fromid,toid) in cur: if toid in linkscores: pr=self.con.execute('select score from pagerank where urlid=%d' % fromid). fetchone( ) linkscores[toid]+=pr maxscore=max(linkscores.values( )) normalizedscores=dict([(u,float(l)/maxscore) for (u,l) in linkscores.items( )]) return normalizedscores
This code loops through all the words in
wordids looking for links containing these
words. If the target of the link matches one of the search results,
then the PageRank of the source of the link is added to the
destination page's final score. A page with a lot of links from
important pages that contain the query terms will get a very high
score. Many of the pages in the results will have no links with the
correct text, and will get a score of 0.
To enable link-text ranking, just add the following anywhere in
There is no standard set of weightings for these metrics that will work in all cases. Even the major search sites frequently change their methods of ranking results. The metrics you'll use and the weights you'll give them are highly dependent on the application you're trying to build.