Boosting quality search results
How to use popularity to deliver better search results
Since the 1960’s, the holy grail of information retrieval research has been finding the documents most relevant to a user’s query. More recently evaluations sponsored by NIST shepherded academic research in this area, and progress was made. These evaluations relied on expert relevance judgments as the yardstick for their standardized searches and canned document sets, and work tended to focus on improving textual analysis and scoring algorithms.
The advent of the web, and search as big business, changed all that. We soon began to realize that there were just too many web pages that, by textual measures alone, were deemed relevant, and a lot of them seemed like junk. Search engines sought additional measures of web page quality beyond mere text matching. By the time of Google’s IPO in 2004, we had heard about the most well-known of these, the PageRank algorithm, which essentially counted inbound links as a means of judging the intrinsic worth of a web page.
Since then, we’ve learned to game the system by larding pages with false metadata, and by creating link farms. In one especially notorious example, J.C. Penney cornered the market on evening wear (and lots of other items) for the 2010 holiday season. Google issued corrections and altered its search algorithm to combat this, including the use of paid experts’ quality judgments. Plus ça change…
Too Much Information
But what’s a little company to do? In the era of Big Data it seems we all have more information than we know what to do with, and we’re faced with the problem of sifting for quality. One happy development has been the availability of high quality free open-source search software. As readers of this blog may know (see Adventures in Search(solr) and Search suggestions with Solr using multiple analyzers), Safari uses Lucene and Solr to index its books and retrieve relevant results for our users’ search queries.
This search system gives us, as search architects, “sockets” to plug in practically any signal we can dream up (and quantify) as a “boost.” A boost is simply a factor of some kind that becomes part of the relevance formula that ranks results. It’s up to us to decide which signals to incorporate, and how much weight to give them.
The Search for Delicious
We want, not just relevant results, but good results, yet we haven’t said what good means. Robert Pirsig once wrote that “… Quality cannot be defined. If we do define it we are defining something less than Quality itself.” So how do we accomplish the impossible and not only define quality, but reduce it to a numeric quantity?
I believe the distinction to draw is between intrinsic and extrinsic signals. We begin with a host of data to draw on that is intrinsic; that comes from the documents that are the target of the search. On Safari, we have the words from the texts, dates of publication, authors and other book metadata. The extrinsic signals arise out of people’s experience of the books: sales figures, usage statistics, reviews, recommendations, bookmarking, annotation, wish lists, back links, the rankings of paid experts and so on.
In a sense we can’t really define these extrinsic signals, but we can measure them. We can’t say precisely (numerically) why people seem to like Zen and the Art of Motorcycle Maintenance, but we can measure how many people do (and qualify their preference by age, political party, and so on).
But before we do that, let’s just consider whether this is actually a good idea or not.
Popping the Filter Bubble
We can promote things people seem to like. If we all like them, they must be better, right? We’ve been taught by James Surowiecki that crowds are wise. But there are problems with mass judgments.
- Crowds can be deceived. The theory is that crowds, collectively, have access to more information than any single individual, and make better judgments in the aggregate, but this may not be the case when access to information is very tightly controlled, or if the crowd is an isolated sub-group that only listens to itself.
- Crowds can be tyrannical. In many areas, especially in matters of taste, a diversity of individual judgments is often preferable to a single mass one.
One thing we know is that popularity is self-reinforcing. The more popular something is, the more it gets promoted, and the more popular it becomes. Unless you’ve been living under a rock, or in the woods (as I have), you’re probably aware of Eli Pariser’s cautionary screed. Pariser coined the term “filter bubble” to describe a similar problem, even writing a book about it. His basic point is that if we’re concerned about creativity, and our culture at large, we ought to be very careful about recommending only things that will be liked. (You can also watch a more polished version of his presentation as a TED talk, but I like the rough-hewn version above from the Pop Tech! conference).
We worry a lot about perpetuating filter bubbles. Is there enough of an external signal available to us, or will we simply create popular items as an emergent property of internal feedback loops, without any relation to the quality of the items themselves? We believe the answer is – it’s complicated, and it depends. There are a lot of different “popularity” signals we could use. Our intuition is that the closer those signals are to the rankings that they influence – the tighter the feedback loop – the more risk there is of a bubble forming. Conversely, the further from search results we go, the more independent the signal is, the more valuable it is as a true indicator of quality, as long as there is a real association between the signal and the item of interest.
More concretely, let’s contrast search result click statistics with reading statistics. (At Safari we care about reading; an e-commerce site might use purchasing as an “engagement” metric). By search result click statistics, we just mean the number of times users click on a given item in search results. Reading statistics can be gathered in a few ways: number of unique visitors to a book, number of pages read, amount of scrolling. It seems likely that reading statistics will depend on search statistics in some way (how will users manage to read a book if they can’t find it in our search results?). In spite of that, we think the choice of a particular statistic as a popularity metric may in fact matter very deeply.
Result clicks are very directly influenced by relevance ranking. We know that users click on the first search result much more frequently than they do any deeper one, and this bias can be measured as distinct from the quality of the results themselves. Using search result clicks as input to the quality signal is almost certainly too circular and likely to lead to a self-reinforcing bubble. But this bias weakens as users get further away from the search results, engage more deeply, and have an opportunity to form judgments based on experience.
The question that remains is: does the search-rank bias weaken enough? We really don’t know. Safari will be introducing popularity as a sorting criterion to our search, and soon it will become part of our relevance formula as well, and then we’ll see. We believe this will help people find what they’re looking for, in the main, and if it isn’t working, we’ll change it. Of course this begs the question how will we know it’s not working if we’re caught in a bubble of our own making? I’m not worried about that: our customers will surely tell us.
Listening to many voices
We want our search results to reflect the diversity of interests among our readers and of the content we provide. If we push popular results to the top, won’t we be promoting the tyranny of the majority?
Incorporating usage statistics in search result ranking gives voice to users, and in a broad way, creates a cultural phenomenon of sorts, but it’s just a first step towards a truly rich and diverse culture. We want to foster countercultures and subcultures, too.
To do that, we will need a more nuanced interpretation of our statistics. We’ll want to tread carefully to avoid generating filter bubbles, and be transparent about what we’re doing so users can make informed judgments of their own. In general the directions we can see are:
- Clustering searches topically, and building popularity metrics that are topic-specific.
- Conditioning search results ranking on users’ reading history. In this
way the site can be many things to many people.
For implementation details, continue reading part 2, Implementing Popularity Boosting with Solr