Posted on by & filed under programming, Safari, search, Solr, Tech.

There are some great new search suggestion (aka autocomplete) features showing up in Lucene and Solr. The Solr documentation hasn’t quite caught up, so I thought sharing my experience might be helpful to others working on using these features. The Solr features build on capabilities added to Lucene which are described well in a series of blog posts by Lucene guru Mike McCandless starting with this one, but it still requires a bit of finagling to decide exactly how to use these features.  I’ll describe our use case, and present some interesting extensions I needed in order to get things working just how we wanted.

On we offer users a single search box that searches across a number of different fields: full text, title, author, isbn, and publisher. We realized that to give users the most convenient intuitive search capabilities, we needed to offer search suggestions – the site should guess what the user might be trying to type, and offer these in an unobtrusive menu.  This feature is helpful in several ways. By exposing the terms that are in the index it can often save typing by anticipating a user’s queries, helps search accuracy by correcting spelling mistakes up front, and can even propose serendipitous alternatives.

In our case, we want to complete terms across several fields at once: we don’t know whether the user is typing an author name, or a title, or just wants to do a full-text search. I started by using Solr’s copyField directive to concatenate all the fields into a single field to use as the source of suggestions. This does allow suggestions to be drawn from the terms in all the fields, but it exposed an annoying problem. We want to suggest words from the full text index, but we would really rather suggest complete titles and author names: if the user starts typing “alex” it’s much more helpful to know that we have “Alexandre Rafalovitch” as an author (of Instant Apache Solr for Indexing Data How-To) than it is to know we have the word “alexandre” in the index. In addition, the AnalyzingInfixSuggester offers the ability to suggest in the middle of a phrase while performing analysis transformations (lower casing, etc), so if we start typing “raf” we should see the author’s full name.

The key point is that we want to perform different kinds of analysis on the different fields. Using copyFields we can’t do that: analysis is defined by the destination field of the copyField, so it is the same for all sources. Following Alex R.’s suggestion, I looked into implementing a custom UpdateRequestProcessor Solr plugin that would do something like copyField, during indexing, but applying analysis “up front” and delivering pre-analyzed TokenStreams to Lucene. Essentially this is a way to bypass Solr and Lucene’s mapping from field name -> Analyzer, substituting our own.

The FieldMergingProcessor we wrote to do this is really pretty simple. It reads a map of field name -> field type name, and when it sees input matching a field name, applies the analyzer associated with the given field type. The core of the code to do this looks like this:

Notice that we call the tokenStream() method on a specific analyzer configured for each source field. This lets us do things like analyze the full text by tokenizing it into words, while analyzing author names using a KeywordAnalyzer, which doesn’t do any tokenization.

I got this code running in a simple test, and it seemed to work fine, but when I tried it out with real data I ran into a problem. When a field has multiple values, the Analyzer gets re-used for each of them, and this can’t work. This is because Lucene’s analysis chain is heavily optimized and returns the same TokenStream object every time you call tokenStream(); the TokenStreams are generally cached globally, or per-field, and this reuse policy is controlled by the ReUseStrategy that is applied to the analyzer. With my naïve implementation, I got a “TokenStream contract violation” exception; Lucene helpfully detects miscues of this sort, which is a good thing since the analysis code path can be a bit confusing.

My solution to this problem was to create a PoolingAnalyzerWrapper, which is an Analyzer that wraps another, and provides a pool of TokenStreamComponents that can be reserved and then released and made available for reuse. It is tightly coupled with a PooledReuseStrategy that mediates the association of the pool with the analyzer. Using this, we can create a distinct token stream for each source field value in a document. Then when Lucene is done indexing the document, we release the streams into the pool so they can be re-used for the next document.

The code for this is a little bit complex due to the interlocking design of the Analyzers, TokenStreamComponents, and ReuseStrategy, but in the end all the pieces are in place to make this work without any real problem. ReuseStrategy has set/getStoredValue(Analyzer) methods which are typically used to store a single TokenStreamComponents. We use them to store a TokenStreamComponentsPool which can in turn store multiple TokenStreamComponents.

We’ve released this code as open source, so feel free to use it, and send patches if you see anything to fix or improve.

The gory details are up on our github repo; see especially and

Note: there are some new capabilities in the latest Solr releases that address this in a different way, allowing for multiple suggesters to be run in a single request. This may be a nice feature to explore in the future.


Comments are closed.