While Haystack is a great way for Django developers to get search up and running quickly and with relatively little pain, its strategy of supporting many popular search engines tends to encourage catering to the least common denominator. Fortunately Haystack, like Django, gets out of the way when you need it to and allows you to leverage the power of the underlying engine, like Solr, with little difficulty.
In building a search application for a collection of full-length books, several issues came up that were not well addressed by Haystack’s or Solr’s default configuration. While I made numerous configuration changes, the most generally interesting changes come from the analyzer settings in the schema file.
Searches including contractions and compound words turned out to be particularly difficult, so I spent many hours experimenting with analyzer settings in an attempt to obtain the results I expected. I have collected some of my most notable discoveries here.
An analyzer is a collection of tokenizers and filters. A tokenizer breaks sentences down into tokens (usually words) and a filter can transform tokens in any imaginable way. An analyzer can be set to run on the actual text of the book being indexed, or on the search string. The former is known as an index analyzer and the latter as a query analyzer.
The default configuration of Solr and/or Haystack did not handle contractions well. Searching for aren’t did not return results with are not in them and vice versa.
I evaluated different methods of resolving this and ended up using a brute force method. There are a finite number of contractions in the English language, so why not just use a dictionary with a synonym filter? I started with a file that looked like the following:
Solr’s synonym filter includes all versions of a word in its indexes so that either the contraction or the full English representation will cause a hit. I used this synonym filter only in the index analyzer. Because all forms of the word are in the index, any form entered in the query will match.
Compound Words and Adjacency
The compound word issue first reared its ugly head with the word layoff. The sample book I was searching for had lay off in the title rather than layoff. I investigated Solr word de-compounding filters but was unsuccessful in getting any of them to work correctly (and welcome reproof and instruction in this matter). So I took things into my own hands and employed a trick of questionable veracity.
I have been fascinated with n-grams for some time. An n-gram is a sequence of n items strung together. It can refer to letters, syllables, whole words, or any other unit. Solr supports letter-based n-grams, but my approach was to use word n-grams. Solr’s ShingleFilterFactory is a tool for creating word-based n-grams. I wanted to keep things simple and my database small so I chose to use n-grams of 2 words, otherwise known as bi-grams.
Using n-grams can help to give weight to word adjacency. What Solr’s shingle filter does is to create a number of n-grams that are composed of multiple adjacent words. Using a shingle filter configured to produce bi-grams, a sentence such as “I like traffic lights” becomes a group of bi-grams like “I,” “I like,” “like traffic,” and “traffic lights.” Leaving the original words in the index is also an option. So with this option in place, Solr stores something like “I,” “I like,” “like,” “like traffic,” “traffic,” etc.
At first I configured my shingle filter to create simple bi-grams delimited by a space. But then I realized that a compound word is simply a word bi-gram with no delimiter. When using a shingle filter with no delimiter, a sentence like “I dislike lay offs” becomes “I,” “Idislike,” “dislikelay,” “layoffs”, and “offs.” Now all of a sudden we have a hit for our search term layoffs.
This seems very clever at first, but you also must realize that you may get hits on strings that span words. For example, searching for “weight car ton,” we might now match the sentence, “I bought a carton of milk,” which is completely unintended. Search is voodoo to begin with, so we can always hope that this particular hit will be of such low weight (due to the low number of matching words) that the searcher will be unlikely to see it.
Note that I applied this filter to both the index and query analyzers. The jury (actually the QA team) is still out on this method.
Protecting Contractions and Strange Words
Now I discovered that my contractions were no longer matching. I had switched to a different tokenizer and begun using the the WordDelimiterFilterFactory, which automatically strips off the endings of contractions (in addition to other nice features). This is counterproductive when your goal is to expand contractions. Fortunately, the word delimiter filter has a protected words feature. I simply created yet another dictionary containing all the relevant contractions and the the filter factory left my words alone.
In addition to contractions, there are some special words in the computing industry that can sometimes be difficult to search on. My current list is
f# (I welcome recommendations on additions to this list). Normally these words get stripped of their special characters and have no resemblance to the original.
But I was not yet done fixing my broken contractions. Now that I was combining words into bi-grams, the synonym dictionary that solved my contraction problem broke. No problem, I just mangled the dictionary a bit so that it looked like:
Stemming is the process of removing inflected word endings to arrive at a stem. For instance, the stem of the word jumped is jump. Using a stemmer allows multiple forms of words to match. I chose to use the Hunspell stemmer with a dictionary from the OpenOffice project. This seems to be a little bit more accurate and precise than the default stemmer.
Search is always a work in progress. Even search giants like Google regularly tweak their search algorithm. This is a snapshot of my approach to search at this moment in time. Your mileage may vary and I welcome recommendations and corrections.
Disclaimer: I am not a search expert. I don’t know all the correct terminology and have probably made a number of blunders in my foray into the art of search. Suggestions in the comments are welcome!
The Analyzers Section of my Schema.xml
The ordering of filters in these analyzers is critical. This order has been established through many hours of trial and error.
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <!-- We are indexing mostly HTML so we need to ignore the tags --> <charFilter class="solr.HTMLStripCharFilterFactory"/> <!--<tokenizer class="solr.StandardTokenizerFactory"/>--> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work --> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" protected="protwords.txt"/> <!-- setting tokenSeparator="" solves issues with compound words and improves phrase search --> <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/> <!-- This deals with contractions --> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" expand="true" ignoreCase="true"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" enablePositionIncrements="true" ignoreCase="true"/> <filter class="solr.HunspellStemFilterFactory" dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <!--<tokenizer class="solr.StandardTokenizerFactory"/>--> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work --> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"/> <!-- setting tokenSeparator="" solves issues with compound words and improves phrase search --> <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" enablePositionIncrements="true" ignoreCase="true"/> <filter class="solr.HunspellStemFilterFactory" dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>