Posted on by & filed under search, Solr, Tech.

[Part two of Relevance Beyond Words covers technical implementation details on incorporating popularity into our search ranking.]

The three main steps we take to implement popularity-boosting are:

  1. Gather the popularity data
  2. Insert or update popularity in the search index
  3. Boost queries using popularity

Gathering data

We won’t go into the actual process for harvesting popularity data since this is likely to be quite different for you than it was for us, but do keep in mind that this can be a significant part of the implementation. Suffice it to say that we collect usage data from log files, store it in an aggregate form, with timestamps, in a SQL database, and run a daily import job that indexes new popularity data.

Scaling and aging

Raw usage data will come in the form of number of clicks or other events that are not meaningful, as they’re naturally represented on an absolute scale. We’re interested only in relative usage, as popular documents are those with more usage than other documents. If you want to incorporate popularity as part of the overall relevance score, it’s important to be able to control its effect relative to other components of the score, and this requires knowledge of the range of values it can assume. Ideally, to make the math of your relevance formula meaningful, develop a popularity metric that ranges from 0 to 1 in a controlled fashion as usage increases. There are any number of approaches to scaling (or normalizing) usage statistics, but we think the best approach is to divide usage of each document by the maximum usage for any document over the time period covered by the data being imported. If the import frequency is not constant (i.e. your job could run at any time), you would also want to account for the length of the time period, but for simplicity’s sake let’s ignore that and assume we’ll be importing usage data once a day. With this approach, popularity=0 means the document saw no usage, and popularity=1 means all requests on a given day were for that document. For most sites, recent usage is more relevant than past usage. To account for this, we recommend incorporating a decay factor as part of the update step. The simplest approach is to use a one-pole IIR (infinite impulse response) filter as shown in this recursive equation:

Here, U and U’ are the input and output usage as functions of time t, and Q is a scaling parameter that controls the speed of the decay. This picture shows the behavior of this filter over time in response to a step-wise usage function that remains constant (at 100) for 62 days, and then drops to zero. IIR filter response You can have more precise control over the shape of the response if you’re willing to store more historical samples (U(t-1), U'(t-2), etc.), but for this purpose it isn’t really necessary, and complicates the whole process. To settle on a reasonable value for Q, consider the trailing part of the filter’s response, which shows how long the output signal persists after the input drops to zero. It’s a declining exponential. Choose a time period after which it makes sense for a popularity boost to fade. In our case, we decided that after 30 days of no usage, any historical usage for a document was no longer relevant. The IIR response is infinite, as the name says, so it never drops all the way to zero, and we have to also choose an acceptable cut-off threshold. We decided 10% was low enough. To get a decay of 10% after 30 samples choose Q as follows:

Now that we have our usage data scaled and properly aged, let’s consider how to store it in Solr.

Indexing boosts in Solr search

If you hunt around on the web for HOW-TO guides for popularity boosting with Solr, you’ll probably find yourself reading about the ExternalFileField on the Solr Wiki,over at this other blog post, or on Safari. ExternalFileField is a mechanism that lets you manage some data in a text file outside of Lucene’s index, and the classic use for it seems to be maintaining an external boost like popularity. You might wonder why people have bothered with an external file when they could store popularity data in the index itself. The reason is that popularity tends to change frequently — a typical setup would be to process usage logs daily — and updating a single field in a Lucene document requires re-indexing all its fields. Updating a popularity boost daily in this way would require re-indexing all content every day if it’s in the main index. ExternalFileField gets around this, at the cost of some restrictions: mainly, you can’t search the contents of the EFF in the same way you can the indexed fields. For popularity this isn’t a problem, since we want to use it in scoring but not as part of search. If you wanted to, say, limit your search to items with a popularity between 100 and 200 (whatever that means) you couldn’t do it easily, but that isn’t a requirement here, so the restriction is acceptable. There are other operational problems with EFF. The mechanism for updating it is a little messy as you can’t use the Solr client API to do it. Rather you have to come up with some way to upload the file to your Solr server, and then notify Solr to reload it. It’s a fine solution that has worked for many people, but there is a newer feature called updatable numeric DocValues that also offers a means of scoring using a field that can be updated independent of the main index. Our system uses that feature, so we’ll describe how to do that here. There is an open ticket for providing DocValues updates via Solr, but if you can’t wait for that to be released, you could use our solution which is an UpdateRequestProcessor. I won’t go into the details of that implementation, except to say that it is pretty new and hasn’t been tested in a cloud setup (it should work, but must run on the shards since it accesses the IndexWriter directly), so use with care if you choose to adopt it. One thing to be aware of with DocValues fields is that unlike other Lucene fields, they either exist for every document, or for no documents, and also unlike other fields, there is a distinction between creating the field and updating it. Declaring a DocValues field in the Solr schema is not enough to create the field: you have to add a DocValue to a document in order to create the field. Once the field has been created (for some document), you can update its value for any document. The effect of this is that when you add a new docvalues field, you have to insert (i.e. reindex) at least a single document before attempting to update your new popularity field.

Search query boosting

Now that we have our data all lined up, we are finally in a position to affect users’ query results. First, it’s easy to sort by popularity: you simply sort (in reverse, naturally) by the popularity field. For this, we don’t even need to worry about proper normalization. A more subtle approach to popularity boost is to blend it with the core relevance score. To do this, we use the Solr boost parameter whose values is multiplied with the core (TFIDF or other word-based) relevance score. Assuming that our relevance has been normalized to a range of [0,1], we can apply a boost using a simple formula that includes a free parameter so we can experiment to determine the proper strength of boost to apply (or even allow the user to control it by selecting a corresponding sort option). A good approach here is to use Solr’s edismax query parser and set its boost parameter to:

And then experiment with different values of P. P=0 would disable popularity boosting, while higher values apply more boosting. This syntax allows P to be supplied as a query parameter, so you could easily tinker with it on a web form, or have different user controls supply different values of P.

Experiment

Choose a metric and optimize it. It’s best to run off-line experiments to predict the success of any new search algorithm before unleashing it on your customers. In the end though, there is no substitute for experience. Try it out and see how it works!

Tags: popularity, ranking, scary-looking equations, search, solr,

Comments are closed.