You are previewing Programming Collective Intelligence.

Programming Collective Intelligence

Cover of Programming Collective Intelligence by Toby Segaran Published by O'Reilly Media, Inc.
  1. Programming Collective Intelligence
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. A Note Regarding Supplemental Files
    3. Praise for Programming Collective Intelligence
    4. Preface
      1. Prerequisites
      2. Style of Examples
      3. Why Python?
      4. Open APIs
      5. Overview of the Chapters
      6. Conventions
      7. Using Code Examples
      8. How to Contact Us
      9. Safari® Books Online
      10. Acknowledgments
    5. 1. Introduction to Collective Intelligence
      1. What Is Collective Intelligence?
      2. What Is Machine Learning?
      3. Limits of Machine Learning
      4. Real-Life Examples
      5. Other Uses for Learning Algorithms
    6. 2. Making Recommendations
      1. Collaborative Filtering
      2. Collecting Preferences
      3. Finding Similar Users
      4. Recommending Items
      5. Matching Products
      6. Building a Link Recommender
      7. Item-Based Filtering
      8. Using the MovieLens Dataset
      9. User-Based or Item-Based Filtering?
      10. Exercises
    7. 3. Discovering Groups
      1. Supervised versus Unsupervised Learning
      2. Word Vectors
      3. Hierarchical Clustering
      4. Drawing the Dendrogram
      5. Column Clustering
      6. K-Means Clustering
      7. Clusters of Preferences
      8. Viewing Data in Two Dimensions
      9. Other Things to Cluster
      10. Exercises
    8. 4. Searching and Ranking
      1. What's in a Search Engine?
      2. A Simple Crawler
      3. Building the Index
      4. Querying
      5. Content-Based Ranking
      6. Using Inbound Links
      7. Learning from Clicks
      8. Exercises
    9. 5. Optimization
      1. Group Travel
      2. Representing Solutions
      3. The Cost Function
      4. Random Searching
      5. Hill Climbing
      6. Simulated Annealing
      7. Genetic Algorithms
      8. Real Flight Searches
      9. Optimizing for Preferences
      10. Network Visualization
      11. Other Possibilities
      12. Exercises
    10. 6. Document Filtering
      1. Filtering Spam
      2. Documents and Words
      3. Training the Classifier
      4. Calculating Probabilities
      5. A Naïve Classifier
      6. The Fisher Method
      7. Persisting the Trained Classifiers
      8. Filtering Blog Feeds
      9. Improving Feature Detection
      10. Using Akismet
      11. Alternative Methods
      12. Exercises
    11. 7. Modeling with Decision Trees
      1. Predicting Signups
      2. Introducing Decision Trees
      3. Training the Tree
      4. Choosing the Best Split
      5. Recursive Tree Building
      6. Displaying the Tree
      7. Classifying New Observations
      8. Pruning the Tree
      9. Dealing with Missing Data
      10. Dealing with Numerical Outcomes
      11. Modeling Home Prices
      12. Modeling "Hotness"
      13. When to Use Decision Trees
      14. Exercises
    12. 8. Building Price Models
      1. Building a Sample Dataset
      2. k-Nearest Neighbors
      3. Weighted Neighbors
      4. Cross-Validation
      5. Heterogeneous Variables
      6. Optimizing the Scale
      7. Uneven Distributions
      8. Using Real Data—the eBay API
      9. When to Use k-Nearest Neighbors
      10. Exercises
    13. 9. Advanced Classification: Kernel Methods and SVMs
      1. Matchmaker Dataset
      2. Difficulties with the Data
      3. Basic Linear Classification
      4. Categorical Features
      5. Scaling the Data
      6. Understanding Kernel Methods
      7. Support-Vector Machines
      8. Using LIBSVM
      9. Matching on Facebook
      10. Exercises
    14. 10. Finding Independent Features
      1. A Corpus of News
      2. Previous Approaches
      3. Non-Negative Matrix Factorization
      4. Displaying the Results
      5. Using Stock Market Data
      6. Exercises
      1. What Is Genetic Programming?
      2. Programs As Trees
      3. Creating the Initial Population
      4. Testing a Solution
      5. Mutating Programs
      6. Crossover
      7. Building the Environment
      8. A Simple Game
      9. Further Possibilities
      10. Exercises
    16. 12. Algorithm Summary
      1. Bayesian Classifier
      2. Decision Tree Classifier
      3. Neural Networks
      4. Support-Vector Machines
      5. k-Nearest Neighbors
      6. Clustering
      7. Multidimensional Scaling
      8. Non-Negative Matrix Factorization
      9. Optimization
    17. A. Third-Party Libraries
      1. Universal Feed Parser
      2. Python Imaging Library
      3. Beautiful Soup
      4. pysqlite
      5. NumPy
      6. matplotlib
      7. pydelicious
    18. B. Mathematical Formulas
      1. Euclidean Distance
      2. Pearson Correlation Coefficient
      3. Weighted Mean
      4. Tanimoto Coefficient
      5. Conditional Probability
      6. Gini Impurity
      7. Entropy
      8. Variance
      9. Gaussian Function
      10. Dot-Products
    19. Index
    20. About the Author
    21. Colophon
    22. SPECIAL OFFER: Upgrade this ebook with O’Reilly

Learning from Clicks

One of the major advantages of online applications is that they receive constant feedback in the form of user behavior. In the case of a search engine, each user will immediately provide information about how much he likes the results for a given search by clicking on one result and choosing not to click on the others. This section will look at a way to record when a user clicks on a result after a query, and how that record can be used to improve the rankings of the results.

To do this, you're going to build an artificial neural network that you'll train by giving it the words in the query, the search results presented to the user, and what the user decided to click. Once the network has been trained with many different queries, you can use it to change the ordering of the search results to better reflect what users actually clicked on in the past.

Design of a Click-Tracking Network

While there are many different kinds of neural networks, they all consist of a set of nodes (the neurons) and connections between them. The network you'll learn how to build is called a multilayer perceptron (MLP) network. This type of network consists of multiple layers of neurons, the first of which takes the input—in this case, the words entered by the user. The last layer gives the output, which in this example is a list of weightings for the different URLs that were returned.

There can be multiple middle layers, but the network in this example will just use a single one. This is called the hidden layer because the outside world never interacts with it directly, and it responds to combinations of inputs. In this case, a combination of inputs is a set of words, so you could also think of this as the query layer. Figure 4-4 shows the structure of the network. All the nodes in the input layer are connected to all the nodes in the hidden layer, and all the nodes in the hidden layer are connected to all the nodes in the output layer.

Design of a click-tracking neural network

Figure 4-4. Design of a click-tracking neural network

To ask the neural network to get the best results for a query, the input nodes for the words in that query have their values set to 1. The outputs of those nodes are turned on and they attempt to activate the hidden layer. In turn, the nodes in the hidden layer that get a strong enough input will turn on their outputs and try to activate nodes in the output layer.

The nodes in the output layer then become active in various degrees, and their activity level can be used to determine how strongly a URL is associated with the words in the original query. Figure 4-5 shows a query for "world bank." The solid lines indicate strong connections, and the bold text indicates that a node has become very active.

Neural network response to "world bank"

Figure 4-5. Neural network response to "world bank"

This, of course, depends on the connection strengths being correct. This is achieved by training the network every time someone performs a search and chooses one of the links out of the results. In the network pictured in Figure 4-5, a number of people had previously clicked the World Bank result after a search for "world bank," and this strengthened the associations between the words and the URL. This section will show you how the network is trained with an algorithm called backpropagation.

You might be wondering why you would need a sophisticated technique like a neural network instead of just remembering the query and counting how many times each result was clicked. The power of the neural network you're going to build is that it can make reasonable guesses about results for queries it has never seen before, based on their similarity to other queries. Also, neural networks are useful for a wide variety of applications and will be a great addition to your collective intelligence toolbox.

Setting Up the Database

Since the neural network will have to be trained over time as users perform queries, you'll need to store a representation of the network in the database. The database already has a table of words and URLs, so all that's needed is a table for the hidden layer (which will be called hiddennode) and two tables of connections (one from the word layer to the hidden layer, and one that links the hidden layer to the output layer).

Create a new file called, and create a new class in it called searchnet:

from math import tanh
from pysqlite2 import dbapi2 as sqlite

class searchnet:
    def __init_  _(self,dbname):

    def __del_  _(self):
      self.con.close(  )

    def maketables(self):
      self.con.execute('create table hiddennode(create_key)')
      self.con.execute('create table wordhidden(fromid,toid,strength)')
      self.con.execute('create table hiddenurl(fromid,toid,strength)')
      self.con.commit(  )

The tables currently have no indices, but you can add them later if speed is an issue.

You'll need to create a couple of methods to access the database. The first method, called getstrength, determines the current strength of a connection. Because new connections are only created when necessary, this method has to return a default value if there are no connections. For links from words to the hidden layer, the default value will be −0.2 so that, by default, extra words will have a slightly negative effect on the activation level of a hidden node. For links from the hidden layer to URLs, the method will return a default value of 0.

    def getstrength(self,fromid,toid,layer):
      if layer==0: table='wordhidden'
      else: table='hiddenurl'
      res=self.con.execute('select strength from %s where fromid=%d and toid=%d' %
(table,fromid,toid)).fetchone(  )
      if res==None:
          if layer==0: return −0.2
          if layer==1: return 0
      return res[0]

You'll also need a setstrength method to determine if a connection already exists, and to update or create the connection with the new strength. This will be used by the code that trains the network:

    def setstrength(self,fromid,toid,layer,strength):
      if layer==0: table='wordhidden'
      else: table='hiddenurl'
      res=self.con.execute('select rowid from %s where fromid=%d and toid=%d' %
(table,fromid,toid)).fetchone(  )
      if res==None:
        self.con.execute('insert into %s (fromid,toid,strength) values (%d,%d,%f)' %
        self.con.execute('update %s set strength=%f where rowid=%d' % (table,

Most of the time, when building a neural network, all the nodes in the network are created in advance. You could create a huge network up front with thousands of nodes in the hidden layer and all the connections already in place, but in this case, it will be faster and simpler to create new hidden nodes as they are needed.

This function will create a new node in the hidden layer every time it is passed a combination of words that it has never seen together before. The function then creates default-weighted links between the words and the hidden node, and between the query node and the URL results returned by this query.

    def generatehiddennode(self,wordids,urls):
      if len(wordids)>3: return None
      # Check if we already created a node for this set of words
      createkey='_'.join(sorted([str(wi) for wi in wordids]))
      "select rowid from hiddennode where create_key='%s'" % createkey).fetchone(  )

      # If not, create it
      if res==None:
        "insert into hiddennode (create_key) values ('%s')" % createkey)
        # Put in some default weights
        for wordid in wordids:
        for urlid in urls:
        self.con.commit(  )

In the Python interpreter, try creating a database and generating a hidden node with some example word and URL IDs:

>>import nn
>> mynet=nn.searchnet('nn.db')
>> mynet.maketables(  )
>> wWorld,wRiver,wBank =101,102,103
>> uWorldBank,uRiver,uEarth =201,202,203
>> mynet.generatehiddennode([wWorld,wBank],[uWorldBank,uRiver,uEarth])
>> for c in mynet.con.execute('select * from wordhidden'): print c
(101, 1, 0.5)
(103, 1, 0.5)
>> for c in mynet.con.execute('select * from hiddenurl'): print c
(1, 201, 0.1)
(1, 202, 0.1)

A new node has been created in the hidden layer, and links to the new node have been created with default values. The function will initially respond whenever "world" and "bank" are entered together, but these connections may weaken over time.

Feeding Forward

You're now ready to make functions that will take the words as inputs, activate the links in the network, and give a set of outputs for the URLs.

First, choose a function that indicates how much each node should respond to its input. The neural network described here will use the hyperbolic tangent (tanh) function, shown in Figure 4-6.

The tanh function

Figure 4-6. The tanh function

The x-axis is the total input to the node. As the input approaches 0, the output starts to climb quickly. With an input of 2, the output is almost at 1 and doesn't get much higher. This is a type of sigmoid function, all types of which have this S shape. Neural networks almost always use sigmoid functions to calculate the outputs of the neurons.

Before running the feedforward algorithm, the class will have to query the nodes and connections in the database, and build, in memory, the portion of the network that is relevant to a specific query. The first step is to create a function that finds all the nodes from the hidden layer that are relevant to a specific query—in this case, they must be connected to one of the words in the query or to one of the URLs in the results. Since the other nodes will not be used either to determine an outcome or to train the network, it's not necessary to include them:

    def getallhiddenids(self,wordids,urlids):
      for wordid in wordids:
        'select toid from wordhidden where fromid=%d' % wordid)
        for row in cur: l1[row[0]]=1
      for urlid in urlids:
        'select fromid from hiddenurl where toid=%d' % urlid)
        for row in cur: l1[row[0]]=1
      return l1.keys(  )

You will also need a method for constructing the relevant network with all the current weights from the database. This function sets a lot of instance variables for this class—the list of words, query nodes and URLs, the output level of every node, and the weights of every link between nodes. The weights are taken from the database using the functions that were defined earlier.

    def setupnetwork(self,wordids,urlids):
        # value lists

        # node outputs = [1.0]*len(self.wordids)
        self.ah = [1.0]*len(self.hiddenids) = [1.0]*len(self.urlids)

        # create weights matrix
        self.wi = [[self.getstrength(wordid,hiddenid,0)
                    for hiddenid in self.hiddenids]
                   for wordid in self.wordids]
        self.wo = [[self.getstrength(hiddenid,urlid,1)
                    for urlid in self.urlids]
                   for hiddenid in self.hiddenids]

You're finally ready to create the feedforward algorithm. This takes a list of inputs, pushes them through the network, and returns the output of all the nodes in the output layer. In this case, since you've only constructed a network with words in the query, the output from all the input nodes will always be 1:

    def feedforward(self):
        # the only inputs are the query words
        for i in range(len(self.wordids)):
  [i] = 1.0

        # hidden activations
        for j in range(len(self.hiddenids)):
            sum = 0.0
            for i in range(len(self.wordids)):
                sum = sum +[i] * self.wi[i][j]
            self.ah[j] = tanh(sum)

        # output activations
        for k in range(len(self.urlids)):
            sum = 0.0
            for j in range(len(self.hiddenids)):
                sum = sum + self.ah[j] * self.wo[j][k]
  [k] = tanh(sum)


The feedforward algorithm works by looping over all the nodes in the hidden layer and adding together all the outputs from the input layer multiplied by the strengths of the links. The output of each node is the tanh function of the sum of all the inputs, which is passed on to the output layer. The output layer does the same thing, multiplying the outputs of the previous layer by their strengths, and applies the tanh function to produce the final output. It is easy to extend the network to have more layers by continually using the output of one layer as the input to the next layer.

Now you can write a short function that will set up the network and use feedforward to get the outputs for a set of words and URLs:

    def getresult(self,wordids,urlids):
      return self.feedforward(  )

You can use Python to try this in the network:

>> mynet=nn.searchnet('nn.db')
>> mynet.getresult([wWorld,wBank],[uWorldBank,uRiver,uEarth])

The numbers in the returned list correspond to the relevance of the input URLs. Not surprisingly, because it hasn't yet had any training, the neural network gives the same answer for every URL.

Training with Backpropagation

Here's where things get interesting. The network will take inputs and give outputs, but because it hasn't been taught what a good result looks like, the results are pretty useless. You're now going to train the network by showing it some actual examples of what people searched for, which results were returned, and what the users decided to click on.

For this to work, you need an algorithm that alters the weights of the links between the nodes to better reflect what the network is being told is the right answer. The weights have to be adjusted slowly because you can't assume that the each user will click on an answer that's appropriate for everyone. The algorithm you'll use is called backpropagation because it moves backward through the network adjusting the weights.

When training a network, you always know the desired output of each node in the output layer. In this case, it should be pushed toward 1 if the user clicked on that result, and pushed toward 0 if he did not. The only way to change the output of a node is to change the total input to that node.

To determine how much the total input should be changed, the training algorithm has to know the slope of the tanh function at its current level of output. In the middle of the function, when the output is 0.0, the slope is very steep, so changing the input by only a small amount gives a big change. As the outputs get closer to −1 or 1, changing the input has a smaller effect on the output. The slope of the function for any output value is specified by this function, which you can add to the start of

def dtanh(y):
    return 1.0−y*y

Before running the backpropagation method, it's necessary to run feedforward so that the current output of every node will be stored in the instance variables. The backpropagation algorithm then performs the following steps.

For each node in the output layer:

  1. Calculate the difference between the node's current output and what it should be.

  2. Use the dtanh function to determine how much the node's total input has to change.

  3. Change the strength of every incoming link in proportion to the link's current strength and the learning rate.

For each node in the hidden layer:

  1. Change the output of the node by the sum of the strength of each output link multiplied by how much its target node has to change.

  2. Use the dtanh function to determine how much the node's total input has to change.

  3. Change the strength of every input link in proportion to the link's current strength and the learning rate.

The implementation of this algorithm actually calculates all the errors in advance and then adjusts the weights, because all the calculations rely on knowing the current weights rather than the updated weights. Here's the code for the algorithm, which you can add to the searchnet class:

    def backPropagate(self, targets, N=0.5):
        # calculate errors for output
        output_deltas = [0.0] * len(self.urlids)
        for k in range(len(self.urlids)):
            error = targets[k]−[k]
            output_deltas[k] = dtanh([k]) * error

        # calculate errors for hidden layer
        hidden_deltas = [0.0] * len(self.hiddenids)
        for j in range(len(self.hiddenids)):
            error = 0.0
            for k in range(len(self.urlids)):
                error = error + output_deltas[k]*self.wo[j][k]
            hidden_deltas[j] = dtanh(self.ah[j]) * error
        # update output weights
        for j in range(len(self.hiddenids)):
            for k in range(len(self.urlids)):
                change = output_deltas[k]*self.ah[j]
                self.wo[j][k] = self.wo[j][k] + N*change

        # update input weights
        for i in range(len(self.wordids)):
            for j in range(len(self.hiddenids)):
                change = hidden_deltas[j]*[i]
                self.wi[i][j] = self.wi[i][j] + N*change

Now all you need is a simple method that will set up the network, run feedforward, and run the backpropagation. This method takes the list of wordids, urlids, and a selected URL:

    def trainquery(self,wordids,urlids,selectedurl):
      # generate a hidden node if necessary

      self.feedforward(  )
      self.updatedatabase(  )

To save the results, you'll also need a method to update the database with the new weights, which are stored in the wi and wo instance variables:

    def updatedatabase(self):
      # set them to database values
      for i in range(len(self.wordids)):
          for j in range(len(self.hiddenids)):
              self.setstrength(self.wordids[i],self. hiddenids[j],0,self.wi[i][j])
      for j in range(len(self.hiddenids)):
          for k in range(len(self.urlids)):
      self.con.commit(  )

Now you can do a simple test with the query you tried earlier to see how the network responds to training:

>> mynet=nn.searchnet('nn.db')
>> mynet.trainquery([wWorld,wBank],[uWorldBank,uRiver,uEarth],uWorldBank)
>> mynet.getresult([wWorld,wBank],[uWorldBank,uRiver,uEarth])

The output for the World Bank URL increased and the output for the other URLs decreased after the network learned that a particular user made that selection. The more users make this selection, the bigger the difference will get.

Training Test

So far you've seen that training with one sample result increases the output for that result. Although that's useful, it doesn't really demonstrate what neural networks are capable of—that is, reasoning about inputs they've never seen before. Try this code in your interactive Python session:

>> for i in range(30):
...     mynet.trainquery([wWorld,wBank],allurls,uWorldBank)
...     mynet.trainquery([wRiver,wBank],allurls,uRiver)
...     mynet.trainquery([wWorld],allurls,uEarth)
>> mynet.getresult([wWorld,wBank],allurls)
[0.861, 0.011, 0.016]
>> mynet.getresult([wRiver,wBank],allurls)
[-0.030, 0.883, 0.006]
>> mynet.getresult([wBank],allurls)
[0.865, 0.001, −0.85]

Even though the network has never seen a query for "bank" by itself before, it gives a reasonable guess. Not only that, it gives the World Bank URL a much better score than the River URL, even though in the training sample queries "bank" was associated just as often with "river" as it was with World Bank. The network has not only learned which URLs are related to which queries, it has also learned what the important words are in a particular query—something that could not have been achieved with a simple query-URL correlation.

Connecting to the Search Engine

The query method of the searcher class gets a list of URL IDs and word IDs in the course of creating and printing the results. You can have the method return these results by adding the following line to the end of the query in

    return wordids,[r[1] for r in rankedscores[0:10]]

These can be passed directly to the trainquery method of searchnet.

The method for capturing which of the results the user liked best is specific to the design of your application. It's possible that a web page could include an intermediate page that captures the click and calls trainquery before redirecting to the actual search, or you could even let users vote on the relevance of search results to help improve your algorithm.

The final step in building the artificial neural network is creating a new method in the searcher class to allow you to weight the results. This function looks pretty similar to the other weighting functions. The first thing you'll need to do is import the neural network class in

import nn

And add this method to the searcher class:

  def nnscore(self,rows,wordids):
    # Get unique URL IDs as an ordered list
    urlids=[urlid for urlid in set([row[0] for row in rows])]
    scores=dict([(urlids[i],nnres[i]) for i in range(len(urlids))])
    return self.normalizescores(scores)

Again, you can experiment by including this in your weights list with various weights. In practice, it's better to hold off on including it as part of your scoring until the network has been trained on a large number of different examples.

This chapter has covered a wide range of possibilities for developing a search engine, but it's still very limited compared to what's possible. The exercises will explore some further ideas. This chapter has not focused on performance—which would require work to index millions of pages—but what you've built will perform adequately on a set of 100,000 pages, enough for a news site or corporate intranet.

The best content for your career. Discover unlimited learning on demand for around $1/day.