Posted on by & filed under search, Tech.

There is a specific problem we run into again and again whenever we deal with searching a sufficiently diverse set of content. We call it the dictionary problem because it’s particularly endemic to dictionaries, but it’s a general problem that tends to prioritize minimal entries that mention a subject over larger entries that are about a subject. But in order to explain why this happens, I’m going to start by saying a little more about relevance ranking.

Mike Sokolov wrote a lot about relevance ranking a couple of weeks ago, in the context of using popularity to improve search on Safari itself. But for our Pubfactory offerings, we need to use intrinsic signals, because that’s usually all we have, and the audience for those sites often wants very specific results, not results based on others’ use of the site. So we’re ranking the individual items that result from a search based on the characteristics of each item. For a simple search for a single term (the most common search case), the information we have to work with is:

  • How often the search term appears in each part of the result document
  • How much text is in each part of the document

And that’s the extent of the data that will vary between results.

So the main way we control the relevance ranking is by weighting the different parts of the result document differently, such as making the title more important than the text. But in circumstances where we’re ranking items that contain the search term in the same parts of the document, we have two options:

  1. Order by how many times the term appears. This has the problem that a large item that mentions the term a few times will show up before a small item about the term, and {insert title example}
  2. Order by how often the term appears, i.e. how many times it appears divided by the size of the text. This works better than option 1, but it has a particular issue, which we call the “dictionary problem” because it makes dictionary items outrank other items and tends to rank them incorrectly.

When relevance is a function of how much of the document matches the search, a minimal document that includes the term will always be ranked very highly. As an example, consider four example documents. The first is an encyclopedia article about cats. It mentions cats a number of times as it explains various facts about cats in biology, history, and other fields. The second is an encyclopedia article about catnip. It contains biological and historical information, and includes the interest of cats in passing. The third is a dictionary entry for cat. It notes that a cat is an endothermic quadruped and includes other meanings of the word cat as well. The fourth is a dictionary entry for catnip. It is ten words long, and “cat” is one of those words.

When we rank these items for a search for “cat”, the encyclopedia articles will be ranked sensibly because the catnip article matches with less of its content than the cat article. But the search matches 10% of the catnip definition, so that will be ranked more highly than anything else. The user will be confused to have the site assume that “catnip” is more relevant than “cat” itself.

So how do we fix this? There are a few options, and which one makes sense will vary by situation. The simplest is to simply make sure titles are weighted much more heavily than other text. This works for most cases, but depending on the technology, may not sufficiently distinguish between items such as “Cat (mammal)” and “Cat-nap”. A similar option is to do a title-specific search and return those results first. This has some of the same weaknesses as the weighting solution, and adds to the request time for the search, but it’s usually helpful. In some cases, it may be necessary to have the system order the results after retrieving them from the search engine. In one of our systems, we have code that post-processes quicksearch results to prioritize exact matches, then matches beginning with the search term, then matches that merely contain it. Each group is alphabetized to remove any vagaries of the relevance system. This is an extreme case, but not an impossible one.

Hopefully you’ll never have this problem. But if you do, I hope I’ve provided you some assistance in understanding it and dealing with it.

Tags: let's find cats, ranking, search, solr,

One Response to “The “dictionary problem” in search ranking”

  1. Bobby

    Very nice write up Marc. How do other technologies such as MarkLogic handle prioritization of exact matches vs stemming?