Chapter 4. Basic Searching

Two principal stages of text searching are to specify what text to match, and then to put the result set of matches into the desired shape, which includes filtering, grouping, and ordering. In this chapter, we discuss everything that Sphinx offers to support these goals: legacy matching modes, full-text query syntax, and additional nonfull-text tools for processing result sets.

Matching Modes

In the beginning, there were keywords, and nothing but keywords, and no query syntax was supported, and Sphinx just matched all keywords, and that was good. But even in that innocent antediluvian age, diverse people were asking for various querying patterns, and ranking methods, and we heard them, and thus and so matching modes were cast upon Sphinx. And they were four, and accessible via SphinxAPI and its younger brother, SphinxSE they were.

Nowadays, matching modes are just a legacy. Even the very concept of a “matching mode” is already deprecated internally. But we still have to quickly cover them, as two out of three searching APIs (SphinxAPI and SphinxSE) support them and default to a certain legacy mode for compatibility reasons.

Legacy modes were a predefined combination of (very simple) query parsing rules, query-to-document matching rules, and a specific ranking method (called a ranker).

There are four legacy matching modes: ALL, ANY, PHRASE, and BOOLEAN. You could switch between modes using the SetMatchMode() call in SphinxAPI. For instance, the following call in PHP sets the PHRASE mode:

$client->SetMatchMode ( SPH_MATCH_PHRASE );

In ALL, ANY, and PHRASE modes, queries were interpreted as “bags of keywords” and then matched and ranked as specified by the mode. BOOLEAN, in addition, supported the basic Boolean operators (AND, OR, NOT, and parentheses).

ALL

Documents that match all of the keywords are returned. Documents are ranked in the order reflecting how closely the matched words resemble the query (phrase proximity to the query).

ANY

Documents that match any of the keywords are returned. Documents are ranked based on the degree of the phrase proximity to the query, and the number of unique matching documents in every field.

PHRASE

Documents that match the query as an exact phrase are returned. Documents are ranked based on the fields in which the phrase occurs, and their respective user weights.

BOOLEAN

Documents that match a Boolean expression built from keywords, parentheses, and the AND, OR, and NOT operators are returned. Documents are not ranked. It was expected that you will sort them based on a criterion other than relevance.

In addition, there’s one nonlegacy matching mode:

EXTENDED

Documents that match an expression in Sphinx query syntax are returned. (Query syntax supports keywords, parentheses, Boolean operators, field limits, grouping keywords into phrases, proximity operators, and many more things that we will discuss in detail shortly.) Documents are ranked according to one of the available ranking functions that you can choose on the fly.

There were several problems with the legacy matching modes.

First, they were very limited. There was no way to do anything even slightly fancy, like, say, matching “Barack Obama” as an exact phrase and “senator” and “Illinois” as plain keywords at the same time.

Second, they essentially tightly coupled query syntax and a ranking function. So, for instance, when using the ALL mode, you could not ask Sphinx to just apply lightweight ranking and skip keyword positions for speed. In that mode, Sphinx always computes a rather expensive proximity rank. Or the other way around, if you liked the ranking that ANY yielded, you couldn’t get it while matching all words or matching a phrase, on the grounds that the ANY ranking function was nailed onto its matching mode with nine-inch titanium nails.

Third, once we introduced query syntax support, all the matching modes became just limited, particular subcases of that generic, all-encompassing syntax. That’s the course of progress and redundancy in the modern world. The milkman’s lot isn’t as sought after as it once was...

Last but not least, Sphinx used to have a different code path internally for every matching mode, and that was of little help when maintaining and improving it.

The EXTENDED mode fixes all of this. It decouples query syntax from ranking; you can choose a ranking function separately (using either the SetRankingMode() API call or the OPTION ranker=XXX SphinxQL clause). And adding new full-text querying features does not involve a new “matching mode” anymore; you just change your queries.

So, in version 0.9.9, we internally switched everything to use a unified matching engine, formerly exposed only under the EXTENDED matching mode. When you use one of the legacy modes, Sphinx internally converts the query to the appropriate new syntax and chooses the appropriate ranker. For instance, the query one two three will be internally rewritten as follows:

ALL

Query: one two three

Ranker: PROXIMITY

ANY

Query: "one two three"/1

Ranker: PROXIMITY

PHRASE

Query: "one two three"

Ranker: PROXIMITY

BOOLEAN

Query: one two three

Ranker: NONE

Special characters such as quotes and slashes that are reserved in query syntax will also be escaped in rewritten queries.

For compatibility reasons, SphinxAPI and SphinxSE default to the ALL matching mode, so to use query syntax or fancier new ranking functions, you have to explicitly switch to EXTENDED mode:

$client->SetMatchMode ( SPH_MATCH_EXTENDED );

The MATCH() operator in SphinxQL always uses EXTENDED mode, so you don’t have to do anything there to get query syntax.

Full-Text Query Syntax

Sphinx text query syntax builds upon three major cornerstones:

  • Keywords

  • Operators

  • Modifiers

Keywords are just the words you search for, which are treated as atoms, the most basic query building blocks. The chief magic happens courtesy of operators that combine keywords in different ways. Combining keywords with Boolean AND and OR, limiting searching to a given field, and phrase and proximity matching are all operators in Sphinx’s book. Operators take keywords (and sometimes other expressions) as their arguments, and transform them into expressions. We might refer to those as full-text expressions to avoid ambiguity with arithmetic expressions such as 1+2/3. Finally, modifiers are attached to keywords, and affect the way keywords match.

Known Operators

At the time of this writing, Sphinx supports the following operators:

Operator AND

Default implicit operator. Matches when both of its two arguments match. Example (with three keywords and two implicit AND operators between them):

lettuce bacon tomato
Operator OR

Matches when any of its two arguments match. Example:

one | two
"gloomy Sunday" | "Scarborough fair"
Operator NOT

Matches when the first argument matches, but the second one does not. For compatibility reasons, both ! and - are recognized as NOT. Examples:

shaken !stirred
shaken -stirred
Grouping operator (parentheses)

Explicitly denotes the argument boundaries. Example:

(red | green | blue) pixel
bond -(eurodollar bond)
Field limit operator

Matches when its entire argument expression matches within a specified field, or a part of a field, or a set of fields. The operator is @ and is followed by the field name (in the most basic version). Examples:

@title hello
@title[50] cruel world
@(title,content) one (two | three)
@* match me anywhere
Phrase operator

Matches when argument keywords match as an exact phrase. Takes only keywords as arguments. Example:

"Richard of York gave battle in vain"
"All your base are belong to us"
Keyword proximity operator

Matches when all argument keywords that match are found within a given limited distance. Takes only keywords as arguments. Example:

"breakfast Tiffany"~5
"Achilles tortoise"~10
Quorum operator

Matches when at least N argument keywords match, where N is a given threshold. Takes only keywords as arguments. Example:

"good fast cheap"/2
"single sane sexy smart"/3
Strict order operator (operator BEFORE)

Matches when its two arguments not only match, but also occur in exactly the same order as in the operator. Example:

ladies << first
north << (east | west)
NEAR operator

Matches when its two arguments not only match, but also occur within a given limited distance from each other. Example:

bill NEAR/5 monica
(red | black) NEAR/5 (hat | coat)
SENTENCE operator

Matches when its two arguments not only match, but also occur within the same sentence. Takes only keywords and phrases as arguments. Requires the sentence and paragraph indexing feature (the index_sp directive) to be enabled. Example:

pizza SENTENCE anchovies
acquisitions SENTENCE "fiscal year"
PARAGRAPH operator

Matches when its two arguments not only match, but also occur within the same paragraph. Takes only keywords and phrases as arguments. Requires the sentence and paragraph indexing feature to be enabled. Example:

light PARAGRAPH darkness
"harley davidson" PARAGRAPH "marlboro man"
ZONE limit operator

Matches when its entire argument expression matches within a specified document zone, or a set of zones. Requires the zone indexing feature (the index_zones directive) to be enabled. Example:

ZONE:h1 john doe
ZONE:(h1,h2) jane doe

More operators might be implemented over time, so this list isn’t carved in stone, and you should refer to the documentation for the most recent version of Sphinx for updates.

Escaping Special Characters

Query syntax reserves several characters as special operators: parentheses, braces, quotes, vertical pipe, dash, exclamation point, and slash, among others. But sometimes you’ll want the query parser to treat them as ordinary characters in keywords. For example, say you want to index tweets and still be able to search for @sphinxsearch in them. How do you avoid a conflict with the field limit operator?

The answer is the standard one used in Unix utilities and generally in programming languages: escape the characters using backslashes.

@tweetdata \@sphinxsearch rocks

In this example, the first @ is a bona fide field operator, whereas the second is treated as a character to search for in the document. (And Sphinx discards the backslash itself.)

But watch out: you have to be careful lest your environment or programming language consumes the escaping character. Depending on what environments your query passes through, you might need extra backslashes. For instance, when you run this in your MySQL client:

mysql> SELECT * FROM test1 WHERE MATCH('hey \@sphinxsearch');

the MySQL client actually processes the escaping backslash itself, leaving Sphinx to receive the query without the backslash and to think that you wanted to reference a field, resulting in an error:

ERROR 1064 (42000): index test1: query error: no field 'sphinxsearch'
found in schema

So, you need to have two backslashes, one for the MySQL client and one for Sphinx:

mysql> SELECT * FROM test1 WHERE MATCH('hey \\@sphinxsearch');

And speaking of programming environments, SphinxAPI provides a special call named EscapeString() that escapes all known special characters. What it does is nothing more than a straightforward string replacement, so you can reimplement it yourself if needed, but notice that the version in SphinxAPI gets continuously updated as more features and special characters are added.

AND and OR Operators and a Notorious Precedence Trap

As you can see, most of the operators allow not just keywords, but also expressions as their arguments. Parentheses can be arbitrarily nested, phrases or quorums can be put into NEAR or BEFORE operators, and so on.

This immediately brings up the question of operator precedence. For instance, is the query (one two | three) going to be interpreted as ((one two) | three) or as (one (two | three))? The correct answer is the latter, because OR has a higher priority than AND. The list of operators in the order of their precedence levels (highest to lowest) is:

  1. SENTENCE, PARAGRAPH, phrase, proximity, quorum, parentheses

  2. OR

  3. NOT

  4. BEFORE, NEAR

  5. AND

One particularly important implication of this is how OR and AND work together. It’s a frequent mistake to have a query such as:

turtle doves | French hens | calling birds | golden rings

that will, because of higher OR priority, disambiguate to:

turtle (doves|French) (hens|calling) (birds|golden) rings

Not quite the expected result! We definitely didn’t have any “turtle rings” in mind, combined with either “hens” or “calling.” The right way to express that query would be to use parentheses and explicitly group word pairs together:

(turtle doves) | (French hens) | (calling birds) | (golden rings)

Along the same lines:

(stray SENTENCE cat|dog)

would, in fact, be interpreted as:

((stray SENTENCE cat) | dog)

In this case, the “stronger” SENTENCE operator wins over the “weaker” OR operator, exactly as the “stronger” OR won over the “weaker” AND in the preceding example. So, to match “stray” in the same sentence with either “cat” or “dog”, one would need to use:

(stray SENTENCE (cat | dog))

NOT Operator

The NOT operator likes company. So much, in fact, that it can’t bear being alone, and querying for just !alone will result in an error message mentioning that the “query is non-computable.” (There are several different flavors of this message, but they all basically mean the same thing.)

A more appropriate (but less readable) name for this operator would be AND NOT, because that’s how it actually works. The problem with handling a single-argument NOT X is that matching all documents that do not match X means matching all documents in the first place, then omitting those that do match X from the list. And that can be a huge number of documents. Although Sphinx can do this, such queries are most likely the result of human error, so Sphinx chooses to protect you from wasting a lot of server effort matching a gazillion documents by mistake.

In case you want exactly that behavior, there’s an easy workaround. Just attach and index a magic keyword to every document, and query for it explicitly:

allthedocuments !X

Another subtlety is that NOT comes in two flavors. Both the hyphen (-) and the exclamation point (!) can be used as NOT interchangeably. Well, almost. There’s a difference in behavior between the two when NOT’s special character occurs in the middle of a word. Hyphens sometimes do occur in keywords, whereas exclamation points normally do not. So to follow the principle of least confusion, Sphinx never treats a hyphen within a keyword as an operator. However, it always handles an exclamation point as such.

For example, let’s say you’re querying for foo-bar. Sphinx will treat the hyphen as keyword data. Thus, when the dash is in charset_table, Sphinx will simply process this entire query as a single keyword with a dash. And when it is not, it will replace the dash with whitespace, resulting in a query equivalent to (foo bar), with two keywords. On the other hand, foo!bar will always be handled as two keywords with a NOT, equivalent to (foo !bar).

Field Limit Operator

The field limit operator is a commonly used operator, because people frequently want to limit their searches to an email subject header, or a forum thread title, or a more complicated set of fields.

The syntax is @fieldname—and it was conceived before Twitter. (Well, maybe not before the first public release of Twttr by Odeo, but definitely before everyone and his dog was surprised to discover himself on Twitter by Twitter Inc.). If we could do it over, we might have gone with something that wouldn’t conflict with the suddenly popular use of the at sign. Or not.

Formal precedence rules for this operator are rather cumbersome. Luckily, they’re very easy to explain informally:

The field limit operator constrains everything that follows up to either the next field operator or the next right parenthesis.

The operator comes in four different forms:

@fieldname

Simplest form that constrains matching to a single field

@fieldname[n]

Where n must be an integer, limits matching to the first n keywords within a given field

@(field1,field2,...)

Field set form, which limits matching to any of the fields in a comma-separated list

@*

Resets field matching and removes all limits

But what do I mean by “constrains everything” in the definition? This is perhaps best explained with an example that combines field limits and OR. Consider this query:

Holmes | @author Doyle | Watson | Lestrade

What will this match? As a matter of fact, it will match documents that either mention “Holmes” anywhere, or mention “Doyle”, “Watson”, or “Lestrade” in the author field. Thus, the field limit operator affects everything after the field name (or names in field set form), and the only thing that can stop it from doing so is the right parenthesis. So, if our true intent is to ransack data for documents that either were written by Sir Doyle, or mention any of our three illustrious imaginary friends anywhere, but are not required to appear to be written by the latter two, it is our understanding that it would be in our best interests to alter the query in one of the following ways, sir:

Holmes | (@author Doyle) | Watson | Lestrade
Holmes | @author Doyle | @* Watson | Lestrade
Holmes | Watson | Lestrade | @author Doyle

The first variant uses the grouping operator (parentheses) to limit the scope of the field limit operator. The second one resets field matching back to matching all fields. And the last one just puts the field limit operator at the end of the query so that there’s nothing left for it to constrain.

Phrase Operator

The phrase operator lets you match exact phrases: verbatim quotes, names of persons and locations, and other word sequences of that ilk. Both the operator and its syntax are de facto standard across modern search systems. The syntax is double quotes that enclose a set of keywords:

"to be or not to be"
"Albert Einstein"
"New York"

Only keywords are allowed within quote marks, and any special characters are ignored. For instance, "(red|blue) car" is equivalent to "red blue car". Neither grouping nor OR (nor any other operator) happens inside the quotes.

The phrase operator works with keyword positions, and any peculiarities that you might experience always boil down to positions of keywords either in the index or in the query. In the previous chapter, we discussed how "Microsoft Office" and "Microsoft in the office" queries produce different results when “in” and “the” are stop words that do not get indexed but still influence positions. The latter query actually matches "Microsoft ? ? office" where any keyword is allowed to appear in place of the ? placeholder. As a reminder, setting the stopword_step directive to 0 when indexing makes the two queries equivalent again.

But the phrase operator involves a few more complications, related again to the settings in effect when creating the index.

Sphinx defaults to ignoring the punctuation, flat out: periods, hyphens, question marks, exclamation points, what have you. This means a "punctuation flat" query will match the anteceding sentence, because its keywords are adjacent to each other. They are separated with a comma in the text, but that comma gets removed and ignored, and does not affect assigned keyword positions. But there are two indexing-time directives that change that.

First are the phrase_boundary and phrase_boundary_step directives. They work together to give punctuation some “positional” influence that you can choose. phrase_boundary takes, as its value, a list of characters in a format charset_table uses, while phrase_boundary_step takes a number. Characters specified in phrase_boundary incur a position increment as specified in phrase_boundary_step when indexing. In hopefully plainer English, it’s as though we insert several dummy words every time we notice a phrase boundary (a dot, comma, exclamation point, etc.). After you put common punctuation characters in phrase_boundary and rebuild the index, keywords separated by those characters obtain some personal space, and "punctuation flat" stops matching the previous paragraph:

index test1
{
    # ... other settings here ...
    phrase_boundary      = ., ?, !, U+2C # 2C is a code for comma
    phrase_boundary_step = 3
}

phrase_boundary is very straightforward and merely increments the position counter in the index every time it sees a character, without paying any attention to context. Consequently, a document containing “John D. Doe” gets indexed with that extra position increment between “D” and “Doe”.

Second, there is an index_sp directive that enables some smarter sentence and paragraph indexing code that properly handles a few exceptions to the “separator is a separator is always a separator” rule when it comes to handling periods. A period will not be considered a boundary in the following situations:

  • In the midst of an abbreviation (e.g., “the U.S.A. and Canada”)

  • At the end of an abbreviation (e.g., “Yoyodine Inc. has existed since 1800”)

  • After a middle initial (e.g., “John D. Doe”)

More empirical rules will likely be added in the future to handle further edge cases, but these three yield surprisingly good results. If nothing else, a "john d doe" query starts to deliver a phrase match again with the index_sp feature enabled.

Last but not least, phrase matches can never span multiple fields. Full keyword positions are, in fact, composed of a field ID and a position within that field, so a change in field is a major change in overall position. To emulate matching over field boundaries, you would need to concatenate fields when fetching data:

sql_query = SELECT id, CONCAT(first_name, ' ', last_name) name ...
            FROM documents

Keyword Proximity Operator

The keyword proximity operator matches groups of words that are not necessarily in exactly the specified order, but happen to be in close proximity to each other. The unit for counting proximity is keywords, so everything about keyword positions that we just discussed applies to the proximity operator as well. Truth be told, the original reason for implementing the phrase boundary feature was to emulate sentence-level matching with the proximity operator.

Like phrases, the keyword proximity operator works only on sets of keywords, so its syntax builds upon the phrase syntax and adds a proximity threshold with a tilde:

"Achilles tortoise"~3

This matches if both “Achilles” and “tortoise” appear in the document, and if there are no more than two words between them. For instance, “Achilles catches the tortoise” matches, and so does “Tortoise caught by Achilles”. If the proximity threshold is n, the document does not match when there are n or more extra words between the matching keywords. The document matches if and only if fewer than n keywords “dilute” the matching span.

Here are some rules regarding proximity matching:

  • The order of the keywords within the operator does not impact matching. It can, however, affect ranking; the degree of phrase matching is generally computed using the order of keywords in the query.

  • All words—stop words as well as keywords—contribute to the count. For instance, “Achilles never catches the tortoise” has one word too many to match the previous query, because “the” is part of the count even though it’s a stop word and is not in the index.

  • When more than two words are part of the proximity match, the count applies to all words, and not each group of two. For example, "rock paper scissors"~1 will match any permutation of the three keywords, but will not match “rock, paper and scissors.” One extra “and” in this document is one filler word too many.

So, the rule of thumb is ~1 allows any permutation of keywords but absolutely no extra words inserted between them, ~2 allows fewer than two extra words between keywords, ~3 allows fewer than three extra words between keywords, and so on.

Quorum Operator

A quorum specifies how many keywords must be present in a group to qualify for a match. A quorum looks syntactically similar to the phrase and proximity operators, working like they do with a group of keywords, but unlike them, it does not care about keyword positions in the document. It only requires that “enough” distinct keywords are found there.

The syntax is:

"good fast cheap"/2

The preceding code matches documents that have at least two out of the three keywords. Documents that have all three keywords, of course, also match. Documents with just one keyword do not match.

Just like the proximity operator, keyword order does not affect matching, but can affect ranking when the ranking function uses phrase proximity as a factor.

A quorum with a threshold of 1 is completely equivalent to an OR of all keywords.

Quorum matching is particularly useful when matching all keywords does not return any results, but matching any of the keywords with an OR can return too many matches and can be too slow. Using a quorum operator with a threshold greater than 1 provides a nice trade-off for that case.

Strict Order (BEFORE) Operator

The strict order operator, also sometimes referred to as the BEFORE operator, works on keyword positions, but does not care how many words come between the specified keywords. It matches when its arguments occur anywhere in a single field, so long as they occur in the order specified.

The syntax is two consecutive less-than signs, and arbitrary subexpressions are allowed as arguments. A single less-than sign would not be recognized as an operator and would be silently ignored:

ladies << first
north << (east | west)

To match, both arguments must be in the same field. A document that has “north” in the title field and “east” in the content field does not match north << east, even if the title field preceded the content field in sql_query when indexing. (Technically, Sphinx retains the ordering of fields, but taking that into account is rather error-prone.) To match that query, the document needs both “north” and “east” in one field, and in the right order. A document titled “North America and Far East” would match. “To the east of North 19th Street,” however, would not.

The precedence of BEFORE is lower than that of OR, just like AND precedence is, so be careful about the same notorious catch:

turtle << doves | French << hens | calling << birds

The preceding query is, because of the higher precedence of OR, equivalent to:

turtle << (doves | French) << (hens | calling) << birds

That’s obviously not as expected, and you should use parentheses to group the (turtle << doves) parts together. AND, however, has a lower precedence than BEFORE. So the following query:

partridge << turtle doves << French hens

is perhaps unexpectedly equal to:

(partridge << turtle) (doves << French) hens

and should be fixed using parentheses or double quotes:

partridge << (turtle doves) << (French hens)
partridge << "turtle doves" << "French hens"

It might help to think of AND as being like addition, BEFORE like multiplication, and OR like exponentiation. In an expression such as 1+2*3^4, you raise 3 to the power of 4 first, then multiply that by 2, then finally add 1. Similarly, in a query such as this:

one two << three | four

you compute the OR first, then plug the result into the BEFORE, and then finally do the AND:

(one & (two << (three | four)))

NEAR Operator

The NEAR operator is a generalized version of the keyword proximity operator that works with two arguments, which can be arbitrary expressions. (The proximity operator accepts only plain old keywords.)

The syntax is NEAR/n, where n is an integer representing the maximum distance allowed. The syntax is case-sensitive and space-sensitive, that is, near or Near is recognized as a keyword, and NEAR / 3 (with any whitespace around the slash) is also recognized as a keyword along with extraneous punctuation, but not a valid operator.

Distance is counted in keywords and works exactly as in keyword proximity, that is, NEAR/1 allows any permutation of the arguments but no keywords between them, NEAR/3 allows fewer than three keywords between them, and so on.

Because phrases with multiple words may appear in NEAR, think of n as the maximum distance allowed between the end of the leftmost argument’s occurrence and the beginning of the rightmost one. So, "Saturday night" NEAR/3 "Big prizes" will match “Big prizes given away Saturday night” because there are fewer than three words between the end of “Big prizes” and the beginning of “Saturday night”.

Even though NEAR is a generalized version of proximity, we need to note how the two behave differently. Specifically, the query one NEAR/4 two NEAR/4 three is not equivalent to "one two three"~4 because of differences in gap handling. Indeed, the first query allows up to three keywords between “one” and “two” and then up to three more between “two” and “three”. So a document such as “one x x x two y y y three” matches it. In contrast, the second query just allows up to three filler words between all of the matched keywords. The sample document we just considered has six, and thus will not match the second query.

SENTENCE and PARAGRAPH Operators

The SENTENCE and PARAGRAPH operators require the full-text index to be built with sentence and paragraph detection enabled, using the index_sp=1 directive. Paragraph boundaries are detected by the HTML stripper, so they additionally require the html_strip=1 directive. Without the proper indexing options, the resultant index will fail to store sentence or paragraph location information, and these operators will be reduced to an AND.

The syntax is:

pizza SENTENCE anchovies

SENTENCE and PARAGRAPH must be written in uppercase. Otherwise, they’re handled as keywords to search for instead of as query syntax. Our example would match documents in which “pizza” and “anchovies” occur within a sentence.

Operator arguments are limited to individual keywords, phrases, and instances of the same operator. So the following two queries are valid:

wall SENTENCE "red right hand"
apples PARAGRAPH oranges PARAGRAPH bananas

But these two are not:

(red | black) PARAGRAPH tree
apples SENTENCE oranges PARAGRAPH bananas

Sentence and paragraph boundaries get detected at indexing time using a number of predefined rules. To catch sentence boundaries, punctuation is processed in the tokenizer using the following rules:

  • The question mark (?) and exclamation point (!) always indicate a boundary.

  • A dot (.) is usually a boundary, with a few exceptions, which I mentioned earlier when discussing phrase boundaries. A dot is not considered the end of a sentence:

    • In the midst of an abbreviation, as in “the U.S.A. and Canada”; defined as a single inline dot followed by a capital letter

    • At the end of an abbreviation, as in “Yoyodine Inc. has existed since 1800”; defined as a single inline dot followed by whitespace and a lowercase letter

    • After a middle initial, as in “John D. Doe”; defined as a single capital letter with whitespace to the left and a dot and whitespace to the right

Every HTML tag defined as block-level in the standard triggers a paragraph boundary. In HTML 4, those tags are ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.

Keyword positions get incremented on a boundary: that is, no keyword shares a position with a boundary. Sphinx then stores boundary positions in the index and uses those at query time to check whether there was a separating position between any keywords.

ZONE Limit Operator

Zone indexing and searching essentially adds support for hierarchical document structure, as opposed to the linear structure imposed by text fields.

Zones exist within fields, and map to HTML or XML markup elements. A zone might be everything between <TITLE> and </TITLE>, or <H1> and </H1>, or any other XML tags. Zones can be nested, as the following XML sample illustrates:

<chapter>
<title>Compressing Inverted Files</title>
This chapter discusses a variety of <emphasis>compressions
techniques</emphasis>.
</chapter>
<intermezzo>Text in the middle!</intermezzo>
<chapter>
Another chapter content.
</chapter>

Note that it isn’t necessary to declare all tags as zones. You can choose, say, to index chapter and title as zones, but not intermezzo and emphasis.

Zones are named. Valid identifiers are accepted as names, with a limit of 127 bytes per name. Every collection and every document can contain an arbitrary number of different zones, and an arbitrary number of occurrences (spans) of any given zone. Sphinx cares only about the start and end of each span, so they can nest arbitrarily, and, technically, nothing prevents them from overlapping (which is forbidden in valid HTML and XML), as long as all open spans get closed.

You define what tags are to be indexed as zones in the configuration file using the index_zones directive. Once indexed, zones can be used to limit matching to specified zones only, just like fields.

Unlike full-text fields, zones can overlap and nest, and are not limited in number. However, thanks to their simple fixed structure, fields map ideally to SQL columns and also are generally much more efficient to process. So, zones complement fields but don’t render them obsolete.

Two supported variants of zone limit syntax are:

ZONE:h1 only in header
ZONE:(h1,h2) only in header

Syntax is case-sensitive and whitespace-sensitive, so ZONE must be in uppercase, and spaces are forbidden.

Precisely like searches with fields, searches with zones can be limited to either a single zone or several zones at once, and the ZONE operator affects the rest of the query, until either another ZONE operator or a closing right parenthesis occurs. The ZONE limit operator behavior mimics the field limit operator in syntax as closely as possible.

Searches within a zone match everything in any individual span of that zone, including anything that is in a nested subzone. Consider the example document earlier in this section, indexed with chapter and title defined as zones. Querying for ZONE:chapter inverted matches because even though the most enclosing zone for “inverted” is title that keyword is nevertheless enclosed by the parent chapter zone as well. Querying for ZONE:chapter another variety also matches. Although “another” and “variety” occur in different instances of the chapter zone, they both occur in some chapter, and therefore match in that query for the chapter zone. Finally, querying for ZONE:chapter middle does not match because none of the keyword’s parent zones are chapters.

Keyword Modifiers

In addition to operators, Sphinx also supports the notion of keyword modifiers. Some of the full-text operators (notably phrase, proximity, and quorum) allow only keywords for arguments, and exclude other operators. For instance, parentheses are not allowed within a quorum. Modifiers, however, can appear on keywords everywhere, including within a quorum, a phrase, or a SENTENCE operator. There are three such modifiers:

Exact form modifier (=)

Matches if the keyword occurs in that exact form, as opposed to matching stems. Requires both stemming and index_exact_words=1 enabled in index settings; has no effect otherwise. Example:

=runs
Field start modifier (^)

Matches if the keyword occurs in the very start of a text field. Example:

^hello
Field end modifier ($)

Matches if the keyword occurs in the very end of a text field. Example:

world$

The exact form modifier comes into play when you run your words through any morphology processing—say, through stemming that replaces a keyword with its root form. By default, a stemmed index does not store the original keywords, so you would also have to explicitly set index_exact_words=1 and rebuild the index to enable the exact form modifier to work. In an index with exact words, querying for =runs matches only documents containing that particular word, whereas querying for runs without a modifier would still match any form that reduces to the same root, be it “running”, “runs”, or “run”. In a sense, the exact form modifier means “skip search-time stemming for this keyword.”

Result Set Contents and Limits

We now know pretty much everything about full-text query syntax, but what exactly is Sphinx going to return from a query?

Sphinx’s output is called the result set, and it comes in two parts: matched documents (a.k.a. row data) and metadata. Matched documents are indeed just rows from the Sphinx database. These results always include the document ID and weight, and might also include additional attributes stored in the index and expressions computed on the fly. Metadata provides a few extra things of interest about the result set in general—the number of total matches, per-keyword frequency statistics, and so on.

When you send a query to Sphinx using a programming API, the result combines row data and metadata into a single structure. The specific structure used varies depending on the language you’re using (an associative array in PHP, Perl, and Python; a struct in pure C; a class in Java; etc.), but the structure of member names and their meanings stay the same across APIs. For instance, a result set dump in PHP looks like this:

Array
(
    [error] =>
    [warning] =>
    [status] => 0
    [fields] => Array
        (
            [0] => title
            [1] => content
        )
    [attrs] => Array
        (
            [group_id] => 1
            [date_added] => 2
        )
    [matches] => Array
        (
            [0] => Array
                (
                    [id] => 123
                    [weight] => 201
                    [attrs] => Array
                        (
                            [group_id] => 1
                            [date_added] => 1293982753
                        )
                )
        )
    [total] => 1
    [total_found] => 1
    [time] => 0.002
    [words] => Array
        (
            [test] => Array
                (
                    [docs] => 3
                    [hits] => 5
                )
            [one] => Array
                (
                    [docs] => 1
                    [hits] => 2
                )
        )
)

Typically, you would walk through matches and process data from them, as the following PHP pretty-printer snippet does. Don’t forget to handle errors, though. Reacting to errors is important.

// PHP SphinxAPI specific, return matches as a plain array
// (as opposed to an array indexed with document IDs)
$client->SetArrayResult ( true );

// do query
$result = $client->Query ( "my test query", "indexname" );

if ( !$result )
{
    // handle errors
    print "ERROR: " . $client->GetLastError();
} else
{
    // query OK, pretty-print the result set
    // begin with general statistics
    $got = count ( $result["matches"] );
    print "Query matched $result[total_found] documents total.\n";
    print "Showing matches 1 to $got of $result[total] accessible.\n";

    // print out matches themselves now
    $n = 1;
    foreach ( $result["matches"] as $match )
    {
        // print number, document ID, and weight
        print "$n. id=$match[id], weight=$match[weight], ";
        $n++;

        // print group_id attribute value
        print "group_id=$match[attrs][group_id]\n";
    }
}

We can see quite a number of things in the result besides the match data. Let’s cover them:

error

Error message for this result set. Meaningful only for multiqueries. The standalone Query() API call has a different convention (on error, it returns an empty result set, and the error message will be available through the GetLastError() call).

warning

Warning message for this result set. Meaningful only for using multiqueries.

status

searchd status code. Can take one of the following constant values:

  • SEARCHD_OK, meaning everything went fine

  • SEARCHD_ERROR, meaning there was an error processing this query, and no valid result set was returned

  • SEARCHD_WARNING, meaning the query completed and a valid result set was returned, but with warnings

  • SEARCHD_RETRY, meaning there was a temporary error handling the query, and the client should retry the query later

fields

A list of full-text fields in the queried index.

attrs

A list of attributes (columns) returned in this result set, along with their associated type numbers.

This list can be different from the list of attributes stored in the index, because we might choose in our query to have Sphinx not fetch some of the attributes, compute things on the fly and return them as attributes, and so on.

The numbers attached to attribute names, such as 1 and 2 in the sample dump shown earlier, are attribute type identifiers taken from the SPH_ATTR_xxx family of constants (SPH_ATTR_INTEGER, SPH_ATTR_TIMESTAMP, etc.). They don’t have to be sequential, although by coincidence they appear that way in the dump shown.

matches

A list of matches. Each match has an ID, a weight, and a list of values for the attributes specified in attrs.

total

The total number of accessible matches. (See the upcoming discussion of totals and limits.)

total_found

The total number of matches found in the index. (Also discussed shortly.)

time

Elapsed time, in seconds, with millisecond precision.

words

Per-keyword statistics. This is a list of keywords extracted from the query, along with the total number of documents that match each keyword (docs) and the total number of keyword occurrences in those documents (hits).

Row data is pretty much covered by attrs and matches. Those are the essentials of the search result. The earlier example contained two attributes, group_id and date_added, and their respective types, which are SPH_ATTR_INTEGER and SPH_ATTR_TIMESTAMP. There is just one match (to keep the example concise), with a document ID of 123, an assigned relevance weight of 201, and some attribute values.

All attributes defined in the index are returned by default. When you only need a few of those, use the SetSelect() API call to specify just the ones you need. It takes a single string argument, whose syntax is identical to an SQL select list clause (i.e., everything between SELECT and FROM). For example:

$client->SetSelect ( "author_id, year" );

Restricting the attributes to just what you need is useful not only to avoid clutter, but for client-side performance reasons as well. Fetching just 20 rows with 100 redundant attributes per row means unpacking 2,000 extra values and putting them into a result set. And in a slower scripting language such as PHP, Perl, Python, or Ruby, that results in a very noticeable performance impact.

Frequently, either you know the attribute type up front, or your language can dynamically convert between the types, so you just access the attributes and refer to them without further effort. If you need to figure out the type of the attribute dynamically, you can check the attribute type data in attrs, enabling you to write a generalized result set handler, or just verify your type assumptions.

Everything else besides attributes and matches is metadata. You can check for errors through the error, warning, and status members of the metadata. The fields member is rarely (if ever) used in practice, but is still provided for reference and debugging purposes.

total and total_found are the trickiest part of the metadata. Formally, total is defined as a number of accessible matches, that is, matches that you can actually request and receive from searchd; and total_found is defined as the total number of matches found in the index(es) searched, or in other words, a grand total count of all the matching documents that searchd just processed. And neither of these is the number of matches just returned in matches.

Consider, for the sake of example, the following (real-world) result set: total is 1,000, total_found is 36,123, and matches only contains 20 entries. How do these numbers corroborate, really? Are they arbitrary? No. But they depend on a couple more options that we’ve been sort of hiding up our sleeve so far: query limits. The limits can be set using the following API call:

function SetLimits ( $offset, $limit, $max_matches=0, $cutoff=0 )

The limits’ defaults come into play here. offset and limit are the offset into the result set on the searchd side and the number of matches to pull from there to the application side, and they default to 0 and 20, respectively. (Think of the LIMIT offset, limit clause in MySQL.) Coming up next, max_matches is what controls the result set size on the searchd side. It defaults to 1,000, meaning searchd will keep track of the best 1,000 matches at all times, but never a single match more. cutoff is beyond the scope of the current example, but for the sake of completeness, it’s a threshold that lets you stop searching once it matches that many matches.

So, here’s what happens in the previous example. searchd runs the query, and finds and honestly processes 36,123 matches. That is reflected in total_found. However, as required by the max_matches setting, it only keeps, at most, 1,000 current-best matches in the server-side result set at all times. So, it can’t return a match number of 3,000; it just does not have it. This is reflected in total. Finally, offset and limit default to 0 and 20, which is why only 20 rows are returned in matches in the client-side result set.

You might be wondering why we even have those defaults—that confusing max_matches setting that effectively caps matching at 1,000 matches unless you explicitly bump it, and then an equivalent of LIMIT 0,20 on top. We have the defaults for performance reasons and to ensure clear memory usage constraints. It’s enormously more efficient to work with the top 1,000 matches than to process 10 million matches, keep them all in memory, and then throw them away because the query only wanted to show the first 10 matches on that first search results page. In our choice of a default value of 1,000, we were mimicking Google, which never lets you page past the 1,000th search result. (Sphinx lets you bump that limit easily, though. Just keep in mind that the limit needs to be bumped both in sphinx.conf and in the SetLimits() API call. Otherwise, a server-wide constraint will take precedence over the API call.)

When should you bump max_matches and what are the pros and cons of doing so? Our general recommendation is to keep max_matches within the 1,000 to 10,000 range. If you set it much higher (e.g., a range of 100,000 to 1 million matches), not only will this result in an immediate performance impact because Sphinx needs to preinitialize a result set with more matches, but it’s also an indication that you are highly likely to be doing something, ahem, suboptimal in your application. Most users only ever need the first few pages of search results, and nobody is ever going to actually sit and page through 1 million results. So there’s definitely no need to set max_matches higher than a few thousand when it’s a real, live person who’ll be consuming the search results. But what if it’s not a person, but a computer program that needs to additionally process those search results? In our consulting practice, chances are still high that you can do better than merely raising max_matches through the roof. Result set processing on the Sphinx side, discussed in detail shortly, is sophisticated enough to either fully replace or at least significantly offload application-side processing. To supply you with an exaggerated example, there’s absolutely no sense in pulling 100,000 matches that match ipod and sorting them on the application side by price, as Sphinx can sort them for you much faster. That being said, there still are viable scenarios in which you do have to pull very many matches. Mostly, those arise from data mining tasks, when Sphinx is essentially only used as a low-level keyword searching tool and complex data processing happens outside of it, in the application. Nevertheless, in many cases Sphinx can do everything you need to that result set. And, even with those bulky data mining tasks just mentioned, you can frequently at least do a rough check or preprocessing pass on the Sphinx side and reduce the number of rows that absolutely have to travel to the application.

Back to simple little things, time is the query time elapsed in Sphinx, with millisecond precision, exactly as logged into the query log. So the sample search earlier took 0.002 seconds (which is actually slow for a trivial search that matches one row, but Sphinx was warming up, the machine was busy swapping a bit, and I needed something other than zero for the example). Note that it does not include the network round-trip time spent to send the query to Sphinx and send the result set back. So, time as measured on the client application may and will vary.

Finally, there are per-keyword statistics in the words member of a result set. We can deduce that our query had two keywords, test and one, and that in our full-text indexed document collection, test occurs five times in three different documents, and one occurs two times but in just one document. That’s just general prerecorded statistics for the entire index taken from the dictionary. (And, for reference, you can very quickly extract those without actually doing any searching by using the BuildKeywords() API call.) The numbers of matched occurrences in matched documents only aren’t provided for performance reasons.

Keyword statistics are there for fun and profit. It might be fun to display them along with search results. And then it’s useful to automatically adjust and rerun queries based on these statistics—say, remove a keyword that matches nothing to prevent a query from matching nothing and making the end user upset. We’ll discuss a few of those query rewriting techniques later.

Both results and metadata are, of course, available via SphinxQL as well. Attributes and rows are returned from the query itself:

mysql> SELECT * FROM test1 WHERE MATCH ('test one');
+------+--------+----------+------------+
| id   | weight | group_id | date_added |
+------+--------+----------+------------+
|    1 |   3595 |        1 | 1293982753 |
+------+--------+----------+------------+
1 row in set (0.00 sec)

The SQL result set naturally contains a list of attributes (columns) and matches data (rows), but can’t include the metadata. So, you have to run an additional query to fetch it:

mysql> SHOW META;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total         | 1     |
| total_found   | 1     |
| time          | 0.001 |
| keyword[0]    | test  |
| docs[0]       | 3     |
| hits[0]       | 5     |
| keyword[1]    | one   |
| docs[1]       | 1     |
| hits[1]       | 2     |
+---------------+-------+
9 rows in set (0.00 sec)

The metadata is kept until the next search query (i.e., SELECT), so you can request it several times as needed. SHOW META itself is thus very quick, but normally results in an extra network round trip, obviously. However, if your MySQL client library allows for multiqueries and multiple result sets, you can send the SELECT and SHOW META in one batch, and eliminate that round trip. Recent enough versions of MySQLi in PHP and DBI in Perl are known to support that.

Searching Multiple Indexes

Under most circumstances, you will at some point need to maintain multiple indexes, but search through all of them simultaneously. The other way around, you’d have to store everything in a single, possibly huge, index. And that can only work well in a scenario with a few very specific conditions—when the document collection does not get updated on a daily basis; when it’s OK to utilize a single core for every given search; when you don’t need to combine multiple entity types when searching; and so on. Most real-world tasks are different, and you will likely need more frequent index updates (counted in minutes rather than weeks), scaling across multiple cores, and so forth. Both updates and scaling, as well as a few fancier tasks, require that you be able to search through multiple indexes and combine (aggregate) results. So, let’s look at how that works.

Searching through multiple indexes can be explicit, when you enumerate several indexes in your query call:

$client->Query ( "John Doe", "index1 index2 index3" );

Separators in the index list are ignored, so you can use spaces, commas, semicolons, or anything else.

Sphinx will internally query every index independently, create a server-side result set (the top N best matches from each index, where N equals max_matches), and then combine the obtained sets, sort the combined matches once again (to restore the order you requested), and pick the top N best matches from all the indexes. This “combination” phase is, by default, very quick, unless you set max_matches rather high and there are many actual matches. Sorting several thousand matches in RAM is pretty quick.

The order of indexes in the query is important, however, because it can affect searching results under certain occasions. That’s a nonissue when no rows are shared among indexes, that is, every document ID is unique and only occurs in exactly one index. But when a document ID is duplicated and occurs in both result sets—a case that likely would involve different weights and attribute values!—we have to pick a single version of that document. Sphinx picks the “newer” version from the latter index in the list. For instance, if John Doe matches document 123 in both index1 and index3, and both matches make it into the respective result sets, the data from index3 wins. Note, however, that when document 123 isn’t in the intermediate result set for index3, the final combined result set will still contain data from index1, even if document 123 was actually matched. So, in a sense, matching documents from indexes specified later in the index list replace “older” matches. Therefore, in case of a conflicting duplicate row, you always get a “newer” weight and attribute data in a combined result set.

In made-up pseudo-SQL syntax, this process of eliminating duplicates and combining results can be described as follows:

CREATE TEMPORARY TABLE tmp ...

INSERT INTO tmp SELECT * FROM <index1> WHERE <search-condition>
    ORDER BY <order-condition> LIMIT <max-matches>

REPLACE INTO tmp SELECT * FROM <index2> WHERE <search-condition>
    ORDER BY <order-condition> LIMIT <max-matches>

REPLACE INTO tmp SELECT * FROM <index3> WHERE <search-condition>
    ORDER BY <order-condition> LIMIT <max-matches>
...

SELECT * FROM tmp ORDER BY <order-condition> LIMIT <max-matches>

Internal index search order isn’t specified. In theory, Sphinx can decide to rearrange actual searches in whatever way it deems necessary. The final result set, however, is deterministic and guaranteed to stay the same.

But what does this have to do with quicker updates, scaling in general, and everyday use? The thing is, when using the disk-based indexing backend, partitioning data into multiple indexes is essentially the way to achieve both goals.

Basically, to speed up indexing updates, you put most of the data in a rarely updated “main” archive index (or index set) that only needs to be reindexed once in a while, and you put the tiny “dynamic” fraction of the data that changes actively into a separate “delta” index that can then be rebuilt (very) frequently. Then you search through both the “main” and “delta” indexes.

As for scaling, searches against a single index are single-threaded, so you have to set up several indexes to take advantage of multiple cores, CPUs, and disks, and you can search through all those indexes in one go just as well.

So, in one way or another, sooner or later you are going to divide and conquer and search more than one index in one go and have Sphinx combine the results via the routine we just discussed.

Result Set Processing

Result set processing is among the most powerful of Sphinx’s features. Interestingly, it doesn’t have anything to do with full-text searching. However, it has everything to do with the searching results format that the application sees.

Despite the advertised power, and inevitable tiny devils hidden in numerous details ensuing from said power, it’s still eerily simple to explain. Sphinx supports SELECT. Literally. Almost all of the SQL stuff, with a few Sphinx-specific extensions, too. That’s the definition of how Sphinx can process the result set for you. Admittedly, it’s too general and rather vague, but in a sense it’s complete. Now, for all those details...

Functionality-wise, there are these five cornerstones:

Expressions

When querying, you can access document attributes, compute arbitrary arithmetic expressions, and use the resultant values for filtering, grouping, or sorting purposes.

Filtering (WHERE clause)

The result set can be limited to matches that satisfy a certain condition.

Grouping and aggregates (GROUP BY clause)

The result set can be grouped by a given column. That is, a group of rows that shares a common value in any of the columns can be replaced with a single row representing that group.

Sorting (ORDER BY clause)

The result set can be ordered by a given column or a set of columns, in either ascending or descending order.

Miscellaneous querying options (limits, ranking weights, etc.)

These options let you request different slices of the result set, use different ranking functions, early-stop query processing, and so on.

The preceding rules apply to full-text matches. So, on top of core text searching operations, you can also add arbitrary arithmetic, filtering, grouping, aggregate functions (MIN, MAX, AVG, SUM), ordering—pretty much everything SQL allows, and then some.

Expressions

The classic SQL SELECT lets you enumerate columns and calculate things, and so does Sphinx. In SphinxQL, you also use SELECT, as usual:

SELECT *, price_usd*1.35 AS price_eur FROM products ...

In SphinxAPI, you would have to use a SetSelect() call that takes everything you’d put between SELECT and FROM in SQL:

$client->SetSelect ( "*, price_usd*1.35 AS price_eur" );

Expressions can use the document ID, weight (relevance value), and attributes as their arguments. The four rules of arithmetic, standard comparison operators, Boolean and bitwise operators, and a few standard mathematical functions are all supported:

  • Arithmetic: +, -, *, /

  • Comparison: =, <>, >, <, >=, <=

  • Boolean: AND, OR, NOT

  • Bitwise integer: &, |

  • Standard mathematical functions: ABS, CEIL, FLOOR, SIN, COS, LN, LOG2, LOG10, EXP, SQRT, MIN, MAX, POW

Comparison operators are valid in a scalar context (as we don’t really have a Boolean one). So, (a>b)+3 is legal syntax that returns 4 when the two attributes are equal, or 3 otherwise. The equality and inequality comparisons (= and <>, respectively) on floats come with a feature that is rather unique to Sphinx. They compare values with a small threshold of 1e-6 (that’s approximately how much float precision is actually there when the absolute value is close to one). So, when a is 1e-7 and b is 1e-8, (a=b)+3 will return 4, even though a/b will return 10. This might be inconvenient. On the other hand, sqrt(3)*sqrt(3)=3 returns 1, and without the threshold it would return 0. This might be convenient. So, be careful about that if you’re working with extremely small 32-bit float values. (And in case you absolutely need bitwise comparison for those, IF() can help, as we will discuss shortly.)

The result type (and evaluation mode) is automatically deduced based on argument types and operations and can be a signed 32-bit or 64-bit integer, or a 32-bit floating-point value. That’s loosely based on how expressions work in the C language. However, Sphinx 32-bit sql_attr_uint attributes are (historically) unsigned integers. When evaluated in 32-bit integer mode, their values will be implicitly converted to signed when the operation works on signed, loosely mimicking C again. Then they will be converted back to unsigned when passing data back to the client. So, you might need to reinterpret them as signed values on the application side. Also, if you’re storing 32-bit unsigned integer values that actually utilize the most significant bit (that’s values over 2^31-1, or 2147483647) and do not want those to wrap around zero in the calculations, you might need to forcibly convert them to a signed 64-bit type using the BIGINT() function:

BIGINT(3123456789)*10

There’s one more conversion function called SINT() that converts its argument (an unsigned 32-bit integer) to a signed integer, returning a 64-bit value to make sure large values are preserved:

SINT(1-2)

There’s also one sort of “anti-conversion” function. Unlike C (but like SQL), integer division such as 3/5 is forcibly computed in floats (and returns 0.6 instead of the 0 that almost no one but a C programmer would expect). But returning the truncated integer 0 can also sometimes be necessary (to please C guys, if nothing else). So Sphinx supports an IDIV() function that takes two integer arguments and divides them as integers:

IDIV(3,5)
IDIV(mytimestamp,86400)
IDIV(mydatecode,1000)

Sphinx supports a few more functions that do something beyond fourth grade math, too. Some of them (such as IF(), IN(), and INTERVAL()) are modeled after MySQL and should be familiar to MySQL users, but might come with Sphinx-specific quirks. Other functions, such as GEODIST(), are entirely Sphinx-specific.

IF(cond,iftrue,iffalse)

IF() takes three arguments and returns the second one if the first one is nonzero, or the third one if the first argument is zero. For a floating-point first argument, though, IF(), unlike equality operators, operates through a simple bit comparison instead of using thresholds. So, these two lines will yield different results:

IF ( sqrt(3)*sqrt(3)-3<>0, a, b )
IF ( sqrt(3)*sqrt(3)-3, a, b )

The first one uses a “thresholded” inequality comparison, which tolerates and eliminates the slight floating-point inequality, hence IF returns a. The second one makes IF itself do the bitwise comparison to zero, and because of limited floating-point precision and round-off errors, the argument isn’t exactly zero. So, that IF returns b instead.

IN(expr,val1,val2,...)

IN() takes two or more arguments, and returns either 1 if expr is found in the subsequent list of values, or 0 otherwise. The first argument can be an arbitrary expression, including a multivalue attribute (MVA), but its values must be integer constants. Sphinx presorts the list of constants and does a binary search, so even with a huge list these checks are pretty quick. When the first argument is an MVA, IN will return 1 when any of its values matches.

IN(year,2001,2009,1997)
IN(friend_ids_mva,30755,288,27353,19614,29567,3916,23445,5290)
IN(expr,@uservar)

Value lists can occasionally grow to be huge (up to millions of values), making sending them to searchd on every request a costly overhead—interestingly, (much) more expensive than the filtering itself. Sphinx lets you set a server-global user variable once (via SphinxQL, using SET @uservar=(1,2,3) syntax) and reuse it later. Global variables are shared among different connections but are not saved between searchd runs, and their contents will be lost on shutdown.

INTERVAL(expr,point1,point2,...)

This returns the index of the earliest turning point that is less than the expression in the first argument; that is, INTERVAL() returns 0 when expr<point1, returns 1 when point1<=expr<point2, and so on. The turning point values must be in ascending order (point1<point2<...pointN) for the function to work properly. This function is useful for partitioning values into buckets and has a few applications. For instance, creating “facets” for a price bracket becomes trivial:

SELECT *, INTERVAL(price,30,100,300,1000) AS pricegroup
GROUP BY pricegroup
NOW()

This takes no arguments and returns a Unix timestamp representing the moment the query began to execute. (So, it’s a kind of named constant, because it gets calculated only once per query for performance reasons, and doesn’t change from row to row.)

BITDOT(intval,val0,val1,...)

This interprets the first integer argument as a bit set, and sums all the arguments where a corresponding bit is 1 in that bit set. (The second argument corresponds to bit 0, the third to bit 1, etc.) For instance, BITDOT(5,a,b,c,d) will return a+c. Any of the function’s arguments can be an arbitrary expression, but the first argument must be an integer type. The function, in a sense, is “just” syntax sugar, because theoretically, it could be emulated with bitwise integer operations, resulting in something awkward such as this:

((intval&1)*val0+(intval&2)*val1+(intval&4)*val2)+...)

BITDOT() can be useful for ranking when combined with a so-called FIELDMASK ranker function (discussed in a later chapter) that creates a bit set of matched fields. Another example is when you have object flags stored as bits and want to attach different weight boosts based on flag presence.

GEODIST(lat1,long1,lat2,long2)

This computes a geosphere distance between the two points defined by their latitudes and longitudes, using the WGS84 model. Coordinates need to be in radians and the resultant distance is in meters. Any of the four input coordinates can be an arbitrary expression, and Sphinx internally optimizes them when any of the coordinate pairs is constant.

Filtering

A very common application requirement is to narrow down search results: for instance, to display books published in the past 10 years, or retrieve friends’ blog posts, or list products available in the local neighborhood. With a text-only search engine that does not support storing user attributes and working with them, you would have to fetch all matches out and pick the ones you need in the application. But in all honesty, that’s an approach coming from the Land of Slow, south of Inefficient Mountains, just north of Terminally Crawling River. This can be an especially unpleasant realization if it happens as you deploy from testing to production. A query seemingly works A-OK when there’s just a fistful of matches, but suddenly there are millions. So, what you really want instead, and what Sphinx lets you easily do, is require the search server to filter the full-text matches based on a condition you specify. Not only does that save on sending gazillions of matches to the application, but it also empowers Sphinx to short-circuit searching as it goes. For one thing, documents that do not satisfy filtering criteria are not relevance-ranked, and relevance ranking is a rather expensive effort. The bottom line is that you should never filter on the application side. Always get the attribute data to Sphinx, and have Sphinx do it.

SphinxAPI exposes the following three calls to perform filtering:

function SetFilter ( $attribute, $values, $exclude=false )
function SetFilterRange ( $attribute, $min, $max, $exclude=false )
function SetFilterFloatRange ( $attribute, $min, $max, $exclude=false )

The SetFilter() call is the API’s primary filtering workhorse. It lets you perform equality or presence checks on integer attributes and MVAs. Specify the name of the attribute (or computed expression) to check in the attribute parameter, an array of reference constant values to check against in values, and an optional exclude flag to tell Sphinx whether to include or exclude matches that pass the check. (By default, the results are included.)

Here are a few specific examples that illustrate different kinds of SetFilter() syntax and the equivalent WHERE condition syntax in SphinxQL:

# equality check
$client->SetFilter ( "year", array(2001) );
SELECT ... WHERE year=2001

# non-equality check
$client->SetFilter ( "year", array(2001), true );
SELECT ... WHERE year<>2001

# in-set presence check
$client->SetFilter ( "year", array(1997,2001,2009) );
SELECT... WHERE year IN (1997,2001,2009)

# in-set absence check
$client->SetFilter ( "year", array(1997,2001,2009), true );
SELECT... WHERE year NOT IN (1997,2001,2009)

The other two calls, SetFilterRange() and SetFilterFloatRange(), let you do comparisons instead of just equality checks. That is, they check whether the attribute value falls in the allowed range (as opposed to occurring in an allowed set of values). They are very similar, the only difference being the expected type of min and max arguments: SetFilterRange() expects only integers (either 32-bit or 64-bit), while SetFilterFloatRange() works with floating-point values. (Having two methods instead of one is mostly a host language restriction. If there was a reliable method to tell an integer value from a floating value in each and every popular scripting language in the world that the API is ported to, a separate SetFilterFloatRange() method would be redundant.)

The attribute parameter is the name of an attribute or an expression again, and min and max are the allowed boundaries, inclusive. For instance, this is how you would check that a book was published in the 2000s and that its price is $50 or less:

$client->SetFilterRange ( "year", 2000, 2009 );
$client->SetFilterFloatRange ( "price_usd", 0, 50 );

SELECT ... WHERE year>=2000 AND year<=2009
    AND price_usd>=0 AND price_usd<=50

This example brings us to the question of what happens when you issue multiple SetFilter() calls. The answer is that all of them apply. In other words, all filters that you set via the API are ANDed together. There’s no way to OR filters.

Also, the filters get appended to any existing set of filters; they never replace previous filters. Therefore, this snippet (taken from real-world buggy code) would effectively never match anything:

$client->SetFilter ( "attr", array(1) );
$client->SetFilter ( "attr", array(2) );

As the two filters stack up, the snippet is equivalent to WHERE attr=1 AND attr=2, and that condition never holds true. If the developers simply intended to check that attr equals 2, they should have simply used the second SetFilter() call. To check that attr equals either 1 or 2, enumerate all values in one call:

$client->SetFilter ( "attr", array(1,2) );

Enforced ANDing of filters can, at a glance, seem like a showstopper for queries that, say, need to match either top-rated or recent enough books. But, in fact, that’s a minor inconvenience at most; you still can do that. Remember that filters can be applied not just to the prerecorded document attributes, but to expressions computed on the fly as well. Hence, nothing prevents us from computing a condition expression, and filtering on that:

$client->SetSelect ( "rating>=8.0 OR year>=2000 AS mycond" );
$client->SetFilter ( "mycond", array(1) );

That’s the “official” way to perform complex Boolean filtering via SphinxAPI. In classic SQL, the trick we just used is equivalent to this:

SELECT *, rating>=8.0 OR year>=2000 AS mycond
FROM books WHERE mycond=1

And that’s exactly the syntax SphinxQL currently supports, too. However, it’s a bit clunky and pollutes the result set with a redundant mycond column that always equals 1. It would be cleaner to simply put the condition in the WHERE clause where it belongs:

SELECT * FROM books WHERE rating>8.0 OR year>=2000 AS mycond

At the time of this writing, that syntax is not supported in SphinxQL, but eventually it will be. In the meantime (or if you are somehow locked to one of the earlier versions) you can always use the “filter on expression” approach shown before.

Going back to range filters, there’s another semisubtlety with the API methods that you may have spotted by now. Minimum and maximum range boundaries always come in pairs in API calls, and are inclusive. So, how would one check for a mere attr>3 condition via SphinxAPI? That depends on the attr type. When it’s an integer (we’re concentrating on integers because such precise boundary conditions don’t usually arise in floating-point situations), you can just replace “greater than 3” with “greater than or equal to 4”, and attach a redundant “is less than or equal to a maximum value of integer” condition:

$client->SetFilterRange ( "attr", 4, 4294967295 );

Alternatively, you can go with the “filter on expressions” approach again.

Sorting

An amazingly large number of questions in our current reality have more than just one answer, sometimes very many more, and search queries are no exception. And so—what mischievous little creatures they are!—they can very well return more than just one search result. Therefore, we can pose another question: how are those results to be sorted?

Oops, our very question also has more than one answer. The results of a search for a specific product are, beyond a doubt, to be sorted by price—ideally, shipping and handling and taxes included. Sorting news reports, on the other hand, should at least account for how recent the reports are, and if not, they should just sort by day posted. General web search results need to be sorted by relevance. And so on.

Shipping a million matches from Sphinx to the application does not magically become any less expensive just because we intend to sort them as opposed to filtering them, so Sphinx supports sorting on its side as well. The appropriate SphinxAPI method, dubbed SetSortMode(), comes with a few legacy modes of its own:

function SetSortMode ( $mode, $sortby="" )

The up-to-date approach to sorting is to use the SPH_SORT_EXTENDED mode and pass the sorting condition in its $sortby argument. That’s equivalent to an ORDER BY clause in SphinxQL:

$client->SetSortMode ( SPH_SORT_EXTENDED, "year DESC, @weight DESC" );
... ORDER BY year DESC, @weight DESC

One can use several attribute or expression names in this sorting condition, following each with a DESC or ASC order specification.

There are also five historic modes that can now be replaced with respective “extended” clauses:

SPH_SORT_RELEVANCE

The default sorting mode. Sorts by relevance, with the most relevant documents first. Equivalent to:

ORDER BY @weight DESC, @id ASC
SPH_SORT_ATTR_DESC

Sorts by the attribute specified in $sortby, in descending order. Equivalent to:

ORDER BY $sortby DESC, @id ASC
SPH_SORT_ATTR_ASC

Sorts by the attribute specified in $sortby, in ascending order. Equivalent to:

ORDER BY $sortby ASC, @id ASC
SPH_SORT_TIME_SEGMENTS

Sorts by a so-called time segment computed from an attribute specified in $sortby and the current time. Equivalent to:

SELECT *, INTERVAL($sortby, NOW()-90*86400, NOW()-30*86400,
    NOW()-7*86400, NOW()-86400,  NOW()-3600) AS time_seg
...
ORDER BY time_seg DESC, @weight DESC, @id ASC

Time segments were introduced to sort documents by a combination of freshness and relevance. They split matched documents into six different buckets, which consisted of documents posted less than one hour ago, one hour to one day ago, one day to one week ago, one week to 30 days ago, 30 days to 90 days ago, and more than 90 days ago. Matches are then sorted by a bucket, and then by relevance within the bucket—so that documents posted in the past hour always rank higher than documents posted last week, but within that hour, day, or week, more relevant documents win.

With the advent of expressions, it’s now possible to replace that hardcoded time segment computation with an INTERVAL() call and customize the buckets to your liking.

Just for the sake of completeness, there’s a fifth mode, SPH_SORT_EXPR, which lets you sort by a C function hardcoded at build time in sphinxcustomsort.inl. That mode was introduced before expressions to leave some room for “easy” customization, but now that runtime expressions are in place, the mode is highly unlikely to yield noticeably better performance, and it is very difficult to maintain, is deprecated, and is going to be removed some day. In short, never use it; always use runtime expressions instead.

And last but not least, on the subject of sorting, do you remember the max_matches setting that controls how many matches searchd keeps in memory? It’s important to understand that this has no effect on sorting. No matter what max_matches is set to, it’s guaranteed that the very best match (according to your sorting criteria) will be the number 1; the next best will be the number 2, and so on. So, when you sort by relevance, the most relevant document among the gazillion matched documents is always guaranteed to be number 1 in the result set, no matter whether max_matches is set to just 10, to 20,000, or to a gazillion.

Grouping

The very first thing that springs to mind when talking of grouping, as in SQL’s GROUP BY clause, is the different kinds of reports—how many site users registered that year, how many sales we generated each week, the peak login hours throughout the past month, et cetera, ad infinitum. So, support for grouping on the search engine side might, at first glance, seem peculiar.

But a number of search-related tasks require grouping as well. What’s the average offer price for “ipod” within a 1-mile, 10-mile, or 100-mile radius? How briskly did people blog about Haiti every day last year? How do you display news entries in clusters centered on the same topic and date? The answers involve grouping “raw” matches by this or that, and just as with filtering and sorting, it’s generally much more efficient to have Sphinx process a million matches than drag them outside and work through them.

Grouping via SphinxAPI is provided by the following two calls:

function SetGroupBy ( $attr, $func, $groupsort="@groupby desc" )
function SetGroupDistinct ( $attr2 )

SetGroupBy() tells Sphinx to group rows by a value of a function $func taken from an attribute $attr, then sort the resultant grouped rows by a $groupsort condition. In SphinxQL, it’s equivalent to this clause:

GROUP BY $func($attribute) ORDER BY $groupsort

SetGroupDistinct() makes Sphinx count the number of distinct values of attribute $attr2 while grouping by some other attribute. The equivalent SphinxQL is (you bet) adding COUNT(DISTINCT $attr2) to the SELECT expressions list.

The calls add a few magic columns to the result set. SetGroupBy() adds an @groupby column that contains the value used for grouping and an @count that contains the number of rows in that group. SetGroupDistinct() returns the number of distinct $attr2 values in a magic @distinct column.

Here’s what the bundled test.php application displays when “just” searching and then when grouping (by a group_id attribute):

$ php test.php -i test1 test
Query 'test' retrieved 3 of 3 matches in 0.000 sec.
Query stats:
    'test' found 5 times in 3 documents

Matches:
1. doc_id=1, weight=101, group_id=1, date_added=2011-01-02 18:39:13
2. doc_id=2, weight=101, group_id=1, date_added=2011-01-02 18:39:13
3. doc_id=4, weight=1, group_id=2, date_added=2011-01-02 18:39:13

$ php test.php -i test1 -g group_id test
Query 'test' retrieved 2 of 2 matches in 0.000 sec.
Query stats:
    'test' found 5 times in 3 documents

Matches:
1. doc_id=4, weight=1, group_id=2, date_added=2011-01-02 18:39:13,
   @groupby=2, @count=1
2. doc_id=1, weight=101, group_id=1, date_added=2011-01-02 18:39:13,
   @groupby=1, @count=2

You can see how, with grouping enabled, we’re getting two groups instead of three documents now, along with associated group-by key and counts.

But hey, why aren’t the results in ascending relevance (weight) order anymore?

Remember that the SetGroupBy() call maps to both GROUP BY and ORDER BY clauses in SphinxQL. And, by default, it chooses to order the groups that now make it into the result set by group-by key. However, the matches sorting (the one from the previous section, which can be set with the SetSortMode() call and defaults to relevance order), does not get left out either. When grouping is in effect, it’s used to pick the one row that will represent the group in the final result set. In SQL terms:

When SetGroupBy() is enabled, SetSortMode() is equivalent to the WITHIN GROUP ORDER BY clause in the SphinxQL dialect.

Otherwise, SetSortMode() is equivalent to the ORDER BY clause.

WITHIN GROUP ORDER BY is a SphinxQL-specific extension. The SQL standard does not specify what representative row to pick for a group, and does not introduce any syntax to control the choice. So technically, an SQL database can even return a random row every time you repeat the query without breaking the standard. Our extension lets you specify which row to choose. For instance, as the default sorting mode is essentially @weight DESC, @id ASC, the most relevant row within each group will be picked by default. (If more than one row has the same top relevance value, the document with a smaller document ID wins. This explains why document 1 was returned in the earlier example.) But you can override that and, say, pick the most recently added row instead:

SELECT * FROM test1
GROUP BY group_id
WITHIN GROUP ORDER BY date_added DESC
ORDER BY @weight DESC

Unobtrusively switching from SphinxAPI to SphinxQL now, grouping isn’t only about removing “duplicate” rows that share a common value in a given column; it’s also about computing aggregate functions over such groups of row. Two examples we’ve covered are COUNT(*) and COUNT(DISTINCT attr), but Sphinx supports more. Currently, you can use MIN(), MAX(), AVG(), and SUM(), which covers the ANSI SQL’92 standard. A few more sophisticated aggregate functions (e.g., bitwise operations, standard deviance and variation, etc.) may be added in the future, so refer to the current documentation for the most recent list.

SELECT *, AVG(price) FROM products
WHERE MATCH('ipod')
GROUP BY city_id

An important side note is that Sphinx’s current implementation of aggregate functions can be intentionally imprecise in favor of performance and constrained RAM use. The degree of precision depends on the value of the max_matches option. Sphinx will only keep track of 4*max_matches best groups (as per sorting criteria) at all times. Thus, when the result set contains fewer than that number of groups the aggregate values are guaranteed to be precise. Otherwise, the values can be (slightly) off.

Get Introduction to Search with Sphinx now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.