The European Lucene/Solr conference was held in Dublin from 4-7 November 2013, at the Aviva Stadium. This conference is the best place to meet other folks who are actively engaged in working with these technologies, including many active committers to the Apache projects. Most of the actual collaborative work on Solr and Lucene seems to get done via the mailing list, IRC chat or other channels, but the conference is a time for everyone to get together and disseminate their work a bit more broadly, in a public forum. It’s sponsored by LucidWorks, and they did a very nice job organizing a smoothly-run, friendly and informative conference.
One thing I really like about this conference is its strong technical focus with a good spread across different levels of complexity. The talks range from user experience (“how I built my system using Solr”), to deep dives into implementation details of the lowest levels of Lucene. For an idea how technical this conference is, consider the keynote. Conference keynotes are usually made of equal parts bloviation, prognostication and inspiration. They tend to be feel-good affairs designed to get you thinking about a broad landscape of ideas, but they tend to be short on specifics. This one was a bit different. Michael Busch (of Twitter) gave a keynote (see player below) whose main thrust (after covering some Twitter technical history) was a detailed explanation of the changes they had made in Lucene’s index format, specifically in its postings format (where terms are associated with documents), down to the level of individual bit allocation. Well, I found it inspirational, and as far as I could tell, the rest of the audience sat rapt with attention too. They certainly asked pertinent and probing questions. This wasn’t a surprise. This conference draws a bunch of super smart, highly focused people. And it’s probably worth mentioning, given all the recent media attention to the gender imbalance in our industry, that there were more women present than at the previous two Lucene Revolutions I’ve attended, certainly a positive trend.
I’ll give a quick rundown of my favorite presentations, but these only represent the talks I did attend. It wasn’t possible to see every talk, since there were usually four going on simultaneously, but there should be video and slides available soon for all. In particular, I really wanted to see Renaud Delbru’s talk (High Performance JSON Search and Relational Faceted Browsing with Lucene), since it seemed to resonate with my own work, but I was somehow mesmerized by Shai Erera into seeing his second talk (on faceted search), which was certainly interesting too. There were simply too many excellent presentations.
Michael Busch on Twitter’s Lucene-based search architecture
Busch’s talk was a highlight of the conference. (There’s now video posted, and you can also see an earlier related presentation). It covered the evolution of Twitter’s search architecture from MySQL to regular (“vanilla,” as Busch called it) Lucene to the custom Lucene, “Early Bird,” that they use today. Early Bird is optimized for tiny documents (of course), frequent updates and superfast (“real time”) indexing. Twitter needs tweets to be searchable as soon as they are posted, so Early Bird holds its entire index in memory (on many hosts), never commits and opens a new reader for every search. In vanilla Lucene, new readers can only see new documents once a commit call is issued to the index writer, which has some cost, even when using the “near real time” soft commit feature available since Lucene 4. In Early Bird, newly indexed tweets are available immediately without the need to commit. A few other random facts about Twitter’s search:
- Indexes 500 million tweets per day, on average. The record was 30 thousand tweets per second.
- Searches 30 thousand queries per second
- Twitter uses the distributed Snowflake service to generate identifiers
- Twitter “archive” search is a separate service using vanilla Lucene
- Twitter projects are named after birds
- Tweets are stored on HDFS
Shai Erera (IBM) and Adrien Grand (Elastic Search): Recent additions to Lucene
This joint presentation was detailed and informative. The speakers described two very new tools available only in the latest version of Lucene. Erera presented his work on the Replicator, which takes efficient snapshots of a live index, for backup or mirroring, and Grand presented the sorted index. This is an implementation in Lucene of a concept that will be familiar to old-timers form the SQL world as a clustered index. It offers substantial performance speedups when sorting by the field used to sort the stored index records. Erera and Grand each gave an additional talk as well, which was welcome, since they are engaging speakers who know their subjects well. Erera’s second talk was on his work with faceted search in Lucene: this is a different implementation from the more commonly used one available in Solr. The Lucene facets offer much better performance for very large (possibly hierarchical) facet taxonomies at the cost of some additional up-front setup to create a separate facet index. Grand gave a survey of Lucene internals as a conference closer. These were not new developments, but I’m not embarrassed to admit I still learned a lot at this talk. OK, maybe I felt a little stupid when I realized that I never really understood the major distinction between Lucene’s index (which is an array, at heart), and traditional relational indexes (which are btrees and hash tables). Lucene’s array-based indexed is optimized for search performance, but updating it is much more complex (the index is segmented, segments are write-once and therefore lock-free, but have to be merged). Another very useful takeaway from Grand’s talk: for best performance, keep your index size (total index files excluding stored fields:
*.fdt) less than the available free memory on your system so it can be held in file system cache.
Charlie Hull (Flax), Alan Woodward (Flax) on Turning Search Upside Down
Flax provides an updated version of the news-clipping service: companies hire them to be informed of all mentions in the press or wherever they extend their searches. Their problem has an interesting technical dimension since it requires applying large numbers of complex searches against (relatively) small numbers of documents, the reverse of the usual (single search, huge numbers of documents). Their approach is to create an index of the queries and to search that for each document to find queries that might match. After this reverse search, they perform a regular forward search for all the candidate queries. The details of the construction of the reverse index were fascinating, including the idea that
AND-queries can be represented in the index by a single term: since all of them would have to occur in a matching document, one only need index one of them in the query index, ideally the least-frequently occurring one.
Otis Gospodnetić (Sematext) on Analytics and Metrics with a Solr backend
Analytics is an increasingly-popular application of search as companies manage increasing amounts of data. Sematext offers several different products in this area, including a Solr analytics platform I plan to try out just as soon as I get a free moment! Gospodnetić’s talk was a kind of combination of a survey of trends in analytics and some Sematext implementation details. According to Otis, Elastic Search is more focused on analytics (basically doing math with lots of numbers) than Solr is. One highlight for me of this talk was the recycling of Dan Ariely’s laugher about Big Data, which I hereby perpetuate (“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…”). Maybe I’m a little behind the curve, but hey I hadn’t heard it, so thanks, Otis. :)
Rajini Maski (Happiest Minds) on Access Control using Solr
The first talk I saw was Maski’s; I went because she was addressing a difficult problem that we’ve struggled with at Safari. I was hoping for a solution, of course, and although this work is not really complete (it remains a proof-of-concept implementation at best, and nothing open source), there was some interesting discussion of the challenges to be solved and possible approaches. The basic problem is limiting visibility of documents using per-user constraints: only show search results of documents the user is permitted to access. Maski described the classic approach, which is to index content with group identifiers, with users assigned to groups. This works OK when there are a small number of groups, and the assignment of documents to groups doesn’t change frequently. For more dynamic filtering (per-user permissions, or frequently-changing permissions), it is not practical to perform the heavy reindexing required for this approach, and Maski described three different forms of per-user filtering: filtering in the client, post-filtering in Solr and pre-filtering using per-user BitSets held in memory for all logged-in users. This last approach looks especially promising for scenarios with a relatively small number of logged-in users.
This revolution will be televised
There was a very professional-seeming video crew on hand, and organizers were assiduous in making sure that all the Q&A would be caught on microphone, so I expect high-quality video to be made available soon: it will be a great service to the community to disseminate the learning, and yes, the Stump the Chump session too.
flash! videos are now becoming available
Here’s the stump the chump session, which I’m pushing since I won third place: