You are previewing Programming Collective Intelligence.

Programming Collective Intelligence

Cover of Programming Collective Intelligence by Toby Segaran Published by O'Reilly Media, Inc.
  1. Programming Collective Intelligence
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. A Note Regarding Supplemental Files
    3. Praise for Programming Collective Intelligence
    4. Preface
      1. Prerequisites
      2. Style of Examples
      3. Why Python?
      4. Open APIs
      5. Overview of the Chapters
      6. Conventions
      7. Using Code Examples
      8. How to Contact Us
      9. Safari® Books Online
      10. Acknowledgments
    5. 1. Introduction to Collective Intelligence
      1. What Is Collective Intelligence?
      2. What Is Machine Learning?
      3. Limits of Machine Learning
      4. Real-Life Examples
      5. Other Uses for Learning Algorithms
    6. 2. Making Recommendations
      1. Collaborative Filtering
      2. Collecting Preferences
      3. Finding Similar Users
      4. Recommending Items
      5. Matching Products
      6. Building a del.icio.us Link Recommender
      7. Item-Based Filtering
      8. Using the MovieLens Dataset
      9. User-Based or Item-Based Filtering?
      10. Exercises
    7. 3. Discovering Groups
      1. Supervised versus Unsupervised Learning
      2. Word Vectors
      3. Hierarchical Clustering
      4. Drawing the Dendrogram
      5. Column Clustering
      6. K-Means Clustering
      7. Clusters of Preferences
      8. Viewing Data in Two Dimensions
      9. Other Things to Cluster
      10. Exercises
    8. 4. Searching and Ranking
      1. What's in a Search Engine?
      2. A Simple Crawler
      3. Building the Index
      4. Querying
      5. Content-Based Ranking
      6. Using Inbound Links
      7. Learning from Clicks
      8. Exercises
    9. 5. Optimization
      1. Group Travel
      2. Representing Solutions
      3. The Cost Function
      4. Random Searching
      5. Hill Climbing
      6. Simulated Annealing
      7. Genetic Algorithms
      8. Real Flight Searches
      9. Optimizing for Preferences
      10. Network Visualization
      11. Other Possibilities
      12. Exercises
    10. 6. Document Filtering
      1. Filtering Spam
      2. Documents and Words
      3. Training the Classifier
      4. Calculating Probabilities
      5. A Naïve Classifier
      6. The Fisher Method
      7. Persisting the Trained Classifiers
      8. Filtering Blog Feeds
      9. Improving Feature Detection
      10. Using Akismet
      11. Alternative Methods
      12. Exercises
    11. 7. Modeling with Decision Trees
      1. Predicting Signups
      2. Introducing Decision Trees
      3. Training the Tree
      4. Choosing the Best Split
      5. Recursive Tree Building
      6. Displaying the Tree
      7. Classifying New Observations
      8. Pruning the Tree
      9. Dealing with Missing Data
      10. Dealing with Numerical Outcomes
      11. Modeling Home Prices
      12. Modeling "Hotness"
      13. When to Use Decision Trees
      14. Exercises
    12. 8. Building Price Models
      1. Building a Sample Dataset
      2. k-Nearest Neighbors
      3. Weighted Neighbors
      4. Cross-Validation
      5. Heterogeneous Variables
      6. Optimizing the Scale
      7. Uneven Distributions
      8. Using Real Data—the eBay API
      9. When to Use k-Nearest Neighbors
      10. Exercises
    13. 9. Advanced Classification: Kernel Methods and SVMs
      1. Matchmaker Dataset
      2. Difficulties with the Data
      3. Basic Linear Classification
      4. Categorical Features
      5. Scaling the Data
      6. Understanding Kernel Methods
      7. Support-Vector Machines
      8. Using LIBSVM
      9. Matching on Facebook
      10. Exercises
    14. 10. Finding Independent Features
      1. A Corpus of News
      2. Previous Approaches
      3. Non-Negative Matrix Factorization
      4. Displaying the Results
      5. Using Stock Market Data
      6. Exercises
    15. 11. EVOLVING INTELLIGENCE
      1. What Is Genetic Programming?
      2. Programs As Trees
      3. Creating the Initial Population
      4. Testing a Solution
      5. Mutating Programs
      6. Crossover
      7. Building the Environment
      8. A Simple Game
      9. Further Possibilities
      10. Exercises
    16. 12. Algorithm Summary
      1. Bayesian Classifier
      2. Decision Tree Classifier
      3. Neural Networks
      4. Support-Vector Machines
      5. k-Nearest Neighbors
      6. Clustering
      7. Multidimensional Scaling
      8. Non-Negative Matrix Factorization
      9. Optimization
    17. A. Third-Party Libraries
      1. Universal Feed Parser
      2. Python Imaging Library
      3. Beautiful Soup
      4. pysqlite
      5. NumPy
      6. matplotlib
      7. pydelicious
    18. B. Mathematical Formulas
      1. Euclidean Distance
      2. Pearson Correlation Coefficient
      3. Weighted Mean
      4. Tanimoto Coefficient
      5. Conditional Probability
      6. Gini Impurity
      7. Entropy
      8. Variance
      9. Gaussian Function
      10. Dot-Products
    19. Index
    20. About the Author
    21. Colophon
    22. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Using Inbound Links

The scoring metrics discussed so far have all been based on the content of the page. Although many search engines still work this way, the results can often be improved by considering information that others have provided about the page, specifically, who has linked to the page and what they have said about it. This is particularly useful when indexing pages of dubious value or pages that might have been created by spammers, as these are less likely to be linked than pages with real content.

The crawler that you built at the beginning of the chapter already captures all the important information about the links, so there's no need to change it. The links table has the URL IDs for the source and target of every link that it has encountered, and the linkwords table connects the words with the links.

Simple Count

The easiest thing to do with inbound links is to count them on each page and use the total number of links as a metric for the page. Academic papers are often rated in this way, with their importance tied to the number of other papers that reference them. The scoring function below creates a dictionary of counts by querying the link table for every unique URL ID in rows, and then it returns the normalized scores:

  def inboundlinkscore(self,rows):
    uniqueurls=set([row[0] for row in rows])
    inboundcount=dict([(u,self.con.execute( \
      'select count(*) from link where toid=%d' % u).fetchone(  )[0]) \
        for u in uniqueurls])
    return self.normalizescores(inboundcount)

Obviously, using this metric by itself will simply return all the pages containing the search terms, ranked solely on how many inbound links they have. In the dataset, "Programming language" has many more inbound links than "Python," but you'd rather see "Python" first in the results if that's what you searched for. To combine relevance with ranking, you need to use the inbound-links metric in combination with one of the metrics shown earlier.

This algorithm also weights every inbound link equally, which, while nice and egalitarian, is open to manipulation because someone can easily set up several sites pointing to a page whose score they want to increase. It's also possible that people are more interested in results that have attracted the attention of very popular sites. Next, you'll see how to make links from popular pages worth more in calculating rankings.

The PageRank Algorithm

The PageRank algorithm was invented by the founders of Google, and variations on the idea are now used by all the large search engines. This algorithm assigns every page a score that indicates how important that page is. The importance of the page is calculated from the importance of all the other pages that link to it and from the number of links each of the other pages has.

Tip

In theory, PageRank (named after one of its inventors, Larry Page) calculates the probability that someone randomly clicking on links will arrive at a certain page. The more inbound links the page has from other popular pages, the more likely it is that someone will end up there purely by chance. Of course, if the user keeps clicking forever, they'll eventually reach every page, but most people stop surfing after a while. To capture this, PageRank also uses a damping factor of 0.85, indicating that there is an 85 percent chance that a user will continue clicking on links at each page.

Figure 4-3 shows an example set of pages and links.

Calculating the PageRank of A

Figure 4-3. Calculating the PageRank of A

Pages B, C, and D all link to A, and they already have their PageRanks calculated. B also links to three other pages and C links to four other pages. D only links to A. To get A's PageRank, take the PageRank (PR) of each of the pages that links to A divided by the total number of links on that page, then multiply this by a damping factor of 0.85, and add a minimum value of 0.15. The calculation for PR(A) is:

PR(A) = 0.15 + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) + PR(D)/links(D) )
      = 0.15 + 0.85 * ( 0.5/4 + 0.7/5 + 0.2/1 )
      = 0.15 + 0.85 * ( 0.125 + 0.14 + 0.2)
      = 0.15 + 0.85 * 0.465
      = 0.54525

You'll notice that D actually contributes more to A's PageRank than either B or C does, even though it has a lower PageRank of its own, because it links exclusively to A and is able to contribute its entire score.

Pretty easy, right? Well, there's a small catch—in this example, all the pages linking to A already had PageRanks. You can't calculate a page's score until you know the scores of all the pages that link there, and you can't calculate their scores without doing the same for all the pages that link to them. How is it possible to calculate PageRanks for a whole set of pages that don't already have PageRanks?

The solution is to set all the PageRanks to an initial arbitrary value (the code will use 1.0, but the actual value doesn't make any difference), and repeat the calculation over several iterations. After each iteration, the PageRank for each page gets closer to its true PageRank value. The number of iterations needed varies with the number of pages, but in the small set you're working with, 20 should be sufficient.

Because the PageRank is time-consuming to calculate and stays the same no matter what the query is, you'll be creating a function that precomputes the PageRank for every URL and stores it in a table. This function will recalculate all the PageRanks every time it is run. Add this function to the crawler class:

  def calculatepagerank(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute('drop table if exists pagerank')
    self.con.execute('create table pagerank(urlid primary key,score)')

    # initialize every url with a PageRank of 1
    self.con.execute('insert into pagerank select rowid, 1.0 from urllist')
    self.dbcommit(  )

    for i in range(iterations):
      print "Iteration %d" % (i)
      for (urlid,) in self.con.execute('select rowid from urllist'):
        pr=0.15

        # Loop through all the pages that link to this one
        for (linker,) in self.con.execute(
        'select distinct fromid from link where toid=%d' % urlid):
          # Get the PageRank of the linker
          linkingpr=self.con.execute(
          'select score from pagerank where urlid=%d' % linker).fetchone(  )[0]

          # Get the total number of links from the linker
          linkingcount=self.con.execute(
          'select count(*) from link where fromid=%d' % linker).fetchone(  )[0]pr+=0.85*(linkingpr/linkingcount)
        self.con.execute(
        'update pagerank set score=%f where urlid=%d' % (pr,urlid))
      self.dbcommit(  )

This function initially sets the PageRank of every page to 1.0. It then loops over every URL and gets the PageRank and the total number of links for every inbound link. The line in bold shows the formula being applied for each of the inbound links.

Running this function will take a few minutes, but you only need to do it when you update the index.

>>reload(searchengine)
>> crawler=searchengine.crawler('searchindex.db')
>> crawler.calculatepagerank(  )
Iteration 0
Iteration 1
...

If you're curious about which pages from the example dataset have the highest PageRanks, you can query the database directly:

>>cur=crawler.con.execute('select * from pagerank order by score desc')
>> for i in range(3): print cur.next(  )
(438, 2.5285160000000002)
(2, 1.1614640000000001)
(543, 1.064252)
>> e.geturlname(438)
u'http://kiwitobes.com/wiki/Main_Page.html'

"Main Page" has the highest PageRank, which is not surprising since every other page in Wikipedia links to it. Now that you have a table of PageRank scores, using them is just a matter of creating a function to retrieve them from the database and to normalize the scores. Add this method to the searcher class:

  def pagerankscore(self,rows):
    pageranks=dict([(row[0],self.con.execute('select score from pagerank where
urlid=%d' % row[0]).fetchone(  )[0]) for row in rows])
    maxrank=max(pageranks.values(  ))
    normalizedscores=dict([(u,float(l)/maxrank) for (u,l) in pageranks.items(  )])
    return normalizedscores

Once again, you should modify the weights list to include PageRank. For example, try:

    weights=[(1.0,self.locationscore(rows)),
             (1.0,self.frequencyscore(rows)),
             (1.0,self.pagerankscore(rows))]

The results for your searches will take into account content and ranking scores. The results for "Functional programming" now look even better:

2.318146 http://kiwitobes.com/wiki/Functional_programming.html
1.074506 http://kiwitobes.com/wiki/Programming_language.html
0.517633 http://kiwitobes.com/wiki/Categorical_list_of_programming_languages.html
0.439568 http://kiwitobes.com/wiki/Programming_paradigm.html
0.426817http://kiwitobes.com/wiki/Lisp_programming_language.html

The value of the PageRank score is a little harder to see with this closed, tightly controlled set of documents, which likely contains fewer useless pages or pages intended solely to get attention than you'd find on the Web. However, even in this case, it's clear that PageRank is a useful metric for returning higher-level, more general pages.

Using the Link Text

Another very powerful way to rank searches is to use the text of the links to a page to decide how relevant the page is. Many times you will get better information from what the links to a page say about it than from the linking page itself, as site developers tend to include a short description of whatever it is they are linking to.

The method for scoring the pages by their link text takes an additional argument, which is the list of word IDs produced when you perform a query. You can add this method to searcher:

  def linktextscore(self,rows,wordids):
    linkscores=dict([(row[0],0) for row in rows])
    for wordid in wordids:
      cur=self.con.execute('select link.fromid,link.toid from linkwords,link where
wordid=%d and linkwords.linkid=link.rowid' % wordid)
      for (fromid,toid) in cur:
        if toid in linkscores:
          pr=self.con.execute('select score from pagerank where urlid=%d' % fromid).
fetchone(  )[0]
          linkscores[toid]+=pr
    maxscore=max(linkscores.values(  ))
    normalizedscores=dict([(u,float(l)/maxscore) for (u,l) in linkscores.items(  )])
    return normalizedscores

This code loops through all the words in wordids looking for links containing these words. If the target of the link matches one of the search results, then the PageRank of the source of the link is added to the destination page's final score. A page with a lot of links from important pages that contain the query terms will get a very high score. Many of the pages in the results will have no links with the correct text, and will get a score of 0.

To enable link-text ranking, just add the following anywhere in your weights list:

(1.0,self.linktextscore(rows,wordids))

There is no standard set of weightings for these metrics that will work in all cases. Even the major search sites frequently change their methods of ranking results. The metrics you'll use and the weights you'll give them are highly dependent on the application you're trying to build.

The best content for your career. Discover unlimited learning on demand for around $1/day.