The Making of Lux, the XML search engine
About two years ago, I conceived the idea to develop an open source XQuery (i.e. XML Query) search engine. For a lot of reasons, this was a terrible idea. But I did it anyway. This is the story of Lux: what it is, why I made it, what it’s good for, and a bit about how it works. I wanted to title it Fiat Lux, but as it took more than one day, that seemed a bit presumptuous.
(You have probably seen tl;dr meaning “too long, didn’t read,” but I prefer “top left; bottom right,” a quick summary for impatient readers.)
Lasciate ogne speranza, voi ch’intrate
Why was this a terrible idea? First, XML is no longer cool (it was, once, really, although it takes some digging to find the breathless prognostications of the past). The zenith of XML mania was probably almost 10 years ago. Google Trends doesn’t go back that far, but this graph gives a fair picture of the declining interest since 2004 in XML qua XML. Along the same lines, note that the somewhat-trendy Safari Flow topics list doesn’t even mention XML, and offers only a few books like this one whose topic is ostensibly XML.
Another reason this was not such a smart decision is that there were already several native XML databases in existence, including at least three that are open source. For a fuller picture of the landscape, check out Ronald Bourret’s summary. Also note that his last update was in 2010. Wikipedia currently lists only four XML databases on its page. One might ask what happened to all those others on Bourret’s page? A more considerate person might have taken that virtual valley of XML database death as an omen, and yea, abandoned all hope.
Finally, building this system was not going to be simple. It would take a few months even to get a reasonably usable prototype up and running.
Then why? Why take on a mind-bogglingly complicated project with little or no commercial potential and only a limited audience? I’ll explain, but first, a little context.
Searching XML in PubFactory
The idea for Lux sprang from a deficiency in Safari’s PubFactory publishing platform. The first PubFactory versions were based on the powerful and full-featured MarkLogic XML database. Later versions were extended to support alternate back-end search and storage. In particular, we had begun to deploy sites using the open source Solr/Lucene full text search engine, which lacks significant built-in XML indexing and search capabilities. This was a bit of a scary undertaking for us.
At first, we decided not to do it (on a really critical high-visibility project with intense deadline pressure). Then we tried delivering a version of the platform on eXist, one of the open source XML databases. That project was a mixed success. We made the decision to use an unstable version (1.5, I think) of eXist. I think we did this in order to get access to its new Lucene full text index, since we had stringent search requirements from our publisher clients. This was probably a mistake, since the stable version always seemed to be just around the corner, and the unstable one suffered from some really painful issues.
Although that project left some scars among our developers and IT staff, it did prove that the database-neutral middleware API we had built enabled us to switch among databases fairly easily. We also came to see that the hardest parts of the search problem were being solved by Lucene, and in some cases eXist was actually making it harder for us to get Lucene to perform the searches we wanted. These realizations gave us the confidence to take the next step, which was to abandon native XML support in our data store altogether. We found that we had developed sufficient expertise with XML arcana that we were able to implement what was lacking in Solr ourselves.
Essentially we pushed all the XML-aware processing to the extremes of our system, the input and output layers, and performed searches using more basic tools, from an XML-awareness standpoint, in the middle. For the XML pieces, we used Saxon for XSLT, essentially the only open source game in town, and JDOM as an object model in our Java-based framework. But the search engine only spoke text, numbers and dates.
This approach has been successful for us, but I began to realize, as I switched back and forth between our MarkLogic and our Solr projects, that we had lost something. Our system was stripped down to the bare essentials, and some really nice programmer conveniences had been lost. Aside from various enterprise features we had to cobble together ourselves (like scheduled backup, just to name one), a key missing component was the ability to perform complex ad hoc queries. Sure, we could search using Solr’s interface, but it was sadly impoverished in comparison to what we had been accustomed.
That is where the germ of the idea for Lux was born. It was to provide an ad hoc query capability for XML documents stored in Solr/Lucene. There was a clearly perceived need within our company for such a thing. There were other companies and groups providing an analogous capability in different technical environments. It seemed to me that if we wanted this thing, surely other people would, too.
I said that interest in XML has been waning, and JSON is often seen as the successor to XML, at least for messaging and APIs. But there is still a lot of XML. Add JSON to the trend graph, and you’ll see that while its “trend score” is growing, it is still only half of XML’s. And it’s really no wonder that we don’t need a lot of books about XML. XML itself is not really changing any more (in spite of some recent efforts, like µXML). Even if there are only a few XML books, and no XML topic on Safari Flow, if you search for “XML”, almost every book is a match. There are slightly fewer matches for “XML” than for words like “server” or “network,” but twice as many as there are for “JSON”. Really the lack of an XML category simply indicates the sheer pervasiveness, and essential boringness, today, of something that we have simply come to take for granted. Of course there is XML.
It is especially true that there is still a lot of XML in publishing. Even direct-to-HTML advocates who envision producing books entirely in HTML5, like Sanders Kleinfeld, are still producing XML. In spite of some fundamental disagreements about how to deal with empty elements and other parser arcana, we can still treat HTML5 as XML by using an HTML5 parser (like validator.nu) to generate good old SAX events, and thence a tree of nodes. XML tools remain relevant in the brave new world of HTML5 and EPUB3.
As for the competition, there really was no other software that fit the bill we had in mind: XQuery for XML documents stored in Solr. Although the end result, Lux, offers many of the same features as the XML databases, it is really a much simpler thing, more like a query parser on steroids than a database.
And the complexity of the challenge was part of the attraction for me. I’ve always had a penchant for taking on projects that are possibly just a little too involved. I was looking for something to do, since my kids were too old to be read to every night. Besides, it wasn’t as if I was planning to build the whole thing. The really tricky parts: the XQuery interpreter (Saxon), the search index (Lucene), and the robust server implementation (Solr) were already complete. I would just be gluing them all together. Very early in the development I decided to re-use as much of these systems as possible, and to fit in to their way of doing things rather to forge anew.
This decision was driven partly by necessity: I was working at home in the evenings and on the weekends, and I simply didn’t have the bandwidth to build a search index or an XQuery interpreter from scratch, and partly by design: I knew that these tools were widely adopted and accepted, really simply the best available, at least in open source, in Java. The beauty of open source development is to be able to draw from the best available, and re-use the work of the brightest, most dedicated minds on a global scale.
Insert tab A in slot B
One key effort in this project was to convince two strangers, Solr/Lucene and Saxon, to coexist peacefully. We didn’t expect a perfect union, just some kind of detente mediated by a third party. There’s a real benefit to this strong decoupling between the language and the search engine. Lux can benefit from future innovations in both Saxon and Lucene while maintaining a relatively small footprint. There are of course efficiency costs to decoupling. For example the query optimization is a bit clumsier than it would be in a completely integrated system, but these costs are small relative to the benefits.
I made this diagram: to give an idea of the complexity of this mediation projection. It shows Lux’s largest components. I’ll describe just one of them here. I think it may be interesting to take a look at “lazy searcher” since it represents a difference between the two systems that had to be mediated. In fact this component is called LuxSearcher in the code, but really should be called LazySearcher – it’s much more descriptive.
Some background is needed to properly frame the problem. I’ll first explain about lazy evaluation in XQuery, and a bit about document ordering. Then I’ll talk about the Lucene search() functions and how they are essentially incompatible with Saxon’s iterator-driven evaluation strategy (which implements lazy evaluation). Finally I’ll explain what we did to address the mismatch.
XQuery, the lazy language
The XQuery language is a functional language. It has no mutable state or side effects. In XQuery, variables, once set, cannot be changed, and the only effect of a function is to return a value. If you’re used to procedural languages, as most programmers are, writing programs like this can feel a little restrictive, but if you stick with it, you find that functions and recursion are powerful tools that give you a new way of thinking and often make otherwise complicated problems simple.
For the compiler writer, the immutability constraints offer the freedom to perform aggressive optimizations based on provable assumptions about code behavior. One key optimization is lazy evaluation. Lazy evaluation is a powerful technique for avoiding unnecessary work. It operates on the principle of doing nothing until absolutely required. Lazy evaluation is not really feasible in languages with side effects except in very limited scopes for which the compiler is able to prove that no side effects occur.
Note: there is a simple kind of lazy evaluation that shows up in almost every programming language called “short circuit” evaluation. Short circuit evaluation applies to Boolean expressions like (A and B), for which, when A evaluates to true, the evaluation of B is skipped. In a language with side effects, the language semantics must define that this is the behavior, since the program may behave differently depending on whether B is evaluated or not (it might involve a change to a global variable, for example).
In a functional language, it’s not necessary to tell the programmer that this is going on, since there isn’t any way to tell from the behavior of the program. I think many XQuery programmers don’t understand the extent to which the evaluator is ignoring their code. I certainly didn’t. As an example, consider this code sample drawn from a Connect4 programming challenge (note: drawn from the Connect4 challenge we wrote about in this blog last February). The idea is to compute whether the game is complete; whether there is a winner.
The main method here, c4:compute-winner, checks every cell on the board to see whether it forms part of a run of 4, either across or down.
declare function c4:compute-winner ($game as element(game))
for $row in (1 to 7), $col in (1 to 6), $dir in $c4:dirs//dir
let $cell := c4:get-cell ($game, $row, $col)
return c4:check-cell ($game, $row, $col, $cell, 1, $dir)
c4:check-cell checks a single cell by testing whether the maximum length of consecutive cells of the same color is at least 4.
declare function c4:check-cell ($game as element(game), $row, $col, $cell, $len, $dir)
if (c4:count-run($game, $row, $col, $cell, $len, $dir) >= 4)
c4:count-run advances one cell in a given direction, checks whether the neighboring cell has the same color. If it does not, the length is known and returned, otherwise it adds one to the length and recurses to the next cell in the same direction.
declare function c4:count-run ($game as element(game), $row, $col, $cell, $len, $dir)
let $y1 := $row + $dir/@y
let $x1 := $col + $dir/@x
let $nabor := c4:get-cell ($game, $y1, $x1)
if ($nabor = $cell and string($cell))
then c4:count-run ($game, $y1, $x1, $nabor, $len + 1, $dir)
declare function c4:get-cell ($game, $row, $col)
Ultimately c4:compute-winner is used in an expression like
exists(c4:compute-winner(...)) simply to detect whether or not the game is over. There’s no need to compute the lengths of all the runs of cells for every cell on the board once we’ve found one whose length is 4. When we wrote this code, my partner Marc Moskowitz and I struggled for a while to come up with a way to terminate the deep recursion and iteration early. But this is not possible in XQuery 1.0: there is no break statement or early return, and no exception handling. We decided this didn’t matter, and our program ran fast enough, so we forgot about it and moved on.
Later on, I looked into this more deeply and discovered that – lo and behold – Saxon was actually performing the “early exit” for us, by virtue of lazy evaluation. This is kind of remarkable, at least if you come to the problem with a procedural mind set. It’s as if the expression down at the bottom –
count-run(...) >= 4 – were able to call back up through 3 or 4 levels of function nesting, plus several intervening iterations (
for expressions), and say “Hey you up there, we have what you need, you can stop calling us already!” As if that bottom-most expression somehow knew the uses to which its result was going to be put, and could intuit that none of them mattered any more now that the essential fact had been uncovered.
Of course none of that is what is happening. What is going on is that Saxon constructs an iterator for just about every expression that can return more than one item. Your program becomes a great big tree of iterators, and these iterators simply sit there ready to compute the next item in the sequence when they’re asked to. It’s lazy up, down and sideways.
But what does any of this have to do with searching? Before I explain that, we need to add one more peculiarity of XQuery into the mix: document order.
Document order in XQuery
XQuery (following XPath) defines an ordering criterion called “document order.” Document order within a document is well-defined; it refers to the sequence of nodes in the tree as they would be found in a serialized (i.e. text) document. Many expressions returning nodes (in particular path expressions) are specified as returning nodes sorted in document order. This is even true for nodes returned from many documents, although document order among documents is not well-defined; it is left up to an implementation to define what this order means.
It’s a good idea for Lux to be able to declare to Saxon that document sequences are in document order already, since without that, many simple expressions such as
collection() (which returns essentially a random document) get interpreted as ~”get all the documents, sort them, and return the first one”. This would be a very expensive operation indeed, if Saxon were to carry it out, but can be a very cheap one if Lucene does. This is because in order for Saxon to do any work with a document, it needs to retrieve it in full from the index and load it into memory.
Lux has to provide some reasonable default document ordering among its documents. The ordering may be arbitrary, but it must be stable, so that a single query that returns two lists of documents gets them back in the same order. Once this is taken care of, Saxon can use its lazy evaluation to retrieve only the documents it needs to produce the ultimate query result, without any need for additional caching or sorting. Absent any other considerations, we simply choose the order that is easiest for Lucene to produce, namely its internal
Searching in Lucene
To implement the XQuery
collection() function in Lucene, we essentially need to provide the results of a MatchAllQuery, a special query that simply returns all the documents. In Lucene, searching is performed using the IndexSearcher class which offers several
search() method variants.
For those of you not already familiar with Lucene’s internals, a very, very brief explanation of how searching works in Lucene: In order to return the N most relevant documents for a search results page, sorted by relevance, Lucene (roughly speaking) finds all documents matching a query, computes a relevance score for each one, and maintains a priority queue of length N, sorted by that score. It might sound expensive to consider all the documents, and in a way it is, but the beauty of Lucene is the impressive array of optimizations that it deploys to make searching tractable even for very large numbers of documents.
Lucene’s API is designed for this use case: iterate over all the (matching) documents, sorting them, and return the top N. In order to implement
collection() using Lucene, we need to determine the value of N to use. In the case where are queries return results in document order, we don’t care what the order is as long as it’s stable, and we can simply use Lucene’s unordered search function. This returns documents in Lucene’s internal document id order, and results in best performance, since no sorting is required, but we still need to pick some fixed value for N, the size of the result set.
If you were paying careful attention, you probably realized that there is no good choice for N. Because of XQuery lazy evaluation, the
collection() function implementation simply has no way of knowing how many documents are going to be requested. Even in the very common use case, which is to retrieve the first few documents, we have no way of knowing how many that is going to be in the lux:search function. There is simply not enough context available to make that kind of decision.
Lux has two different strategies for dealing with this. One is to pick a reasonable number, like 100, cache those (ids only, not the entire XML document), and retrieve more as needed (re-executing the query, and paging forward to get the batch of ids). This strategy is used when the query results are sorted by relevance, or by a field value. But when documents are returned unordered (in “document” order), it can do better. Its lazy IndexSearcher exposes an iterator-style interface that tracks a position in the Lucene reader hierarchy, and only advances it when the caller requests the next document.
Doing this required peeling back one layer of the Lucene class hierarchy and implementing our own versions of IndexSearcher and DocIdSetIterator. This felt very expert and low-level at the time, but in fact Lucene’s API is designed to be extensible in many different directions, and the code turned out to be pretty simple in the end.
As an aside, we do also have a future optimization planned that should improve the performance of deep paging of queries that are sorted by field value. In this case, we should be able to cache the last sort key value for each page of results, and jump start the iteration of the next page at that position by adding an additional filter to the query.
The strategy of ordering documents by docid (“unordered”) also works in Solr, at least for a single-core index. When we cross over into distributed (Cloud) land, documents may be spread across multiple cores and even hosts, and multiple Lucene indexes, so the docid ordering is no longer sensible. Instead, we define an explicit document identifier (based on document uris) that we also use for ordering.
Out we came and saw the stars again
I described Lux, the XML search index based on Solr, Lucene and Saxon, the story of its creation, and the solution to one of the conundrums it presented. If you read this far, and are curious to learn more about Lux internals, see Lux’s documentation], and its code.