QueryParser

Youâve now been introduced to all the different types of queries available in Ferret, and youâve learned how to build different queries by hand. Some of it probably seems like a lot of work and itâs certainly not something youâd ask a user to do. Luckily, we can leave most of the work to the Ferret QueryParser. Youâve already seen many examples of the Ferret Query Language (FQL) in the previous section (Building Queriesâ), and youâll have noticed that most of the queries you can build in code can be described much more easily in FQL. In this section, weâll talk about setting up the QueryParser, and then weâll go into more detail about FQL.

Setting Up the QueryParser

The QueryParser has a number of parameters, as shown in TableÂ 4-1.

TableÂ 4-1.Â QueryParser parameters

Parameter	Default	Short description
`:default_field`	`:*`	The default field to be searched; it can also be an array.
`:analyzer`	`StandardAnalyzer`	Analyzer used by the query parser to parse query terms.
`:wild_card_downcase`	`true`	Specifies whether wildcard queries should be downcased or not, since they are not analyzed by the parser.
`:fields`	`[]`	Lets the query parser know what fields are available for searching, particularly when the `:*` is specified as the search field.
`:validate_fields`	`false`	Set to `true` if you want an exception to be raised if there is an attempt to search a nonexistent field.
`:or_default`	`true`	Use `OR` as the default Boolean operator.
`:default_slop`	0	Default slop to use in `PhraseQueries`.
`:handle_parser_errors`	`true`	`QueryParser` will quietly handle all parsing errors internally. If youâd like to handle them yourself, set this parameter to `false`.
`:clean_string`	`true`	`QueryParser` will quickly review the query string to make sure that quotes and brackets match up and special characters are escaped.
`:max_clauses`	512	The maximum number of clauses allowed in Boolean queries and the maximum number of terms allowed in multi, prefix, wildcard, or fuzzy queries.

The first thing you need to think about when setting up the QueryParser is which analyzer to use. Preferably, you should use the same analyzer you used to tokenize your documents during indexing. This analyzer will be used to analyze all terms before they are added to queries, except in the case of wildcard queries, since theyâll contain * and ?, which many analyzers wonât accept. Because of this, youâll probably need to lowercase the wildcard query if the analyzer you used was a lowercasing analyzer. The exception to this rule is the use of wildcard queries on fields that are untokenized, in which case you might want to leave them as case-sensitive. To specify whether or not wildcard queries are lowercased, you need to set the parameter :wild_card_downcase. It is set to true by default.

The next thing you need to worry about is document fields. First of all, which fields are available to be searched? When the user specifies the field he wants to search, he can use an * to search all fields. For this to work, you need to set up the QueryParser so that it knows which fields are available. Simply set the parameter :fields to an array of field names. You can get the list of available field names from an IndexReader:

query = QueryParser.new(:fields => reader.fields,
                        :tokenized_fields => reader.tokenized_fields)

The :fields parameter can be either a Symbol or an Array of Symbols. You can also set the QueryParser to validate all queries that use these fields. That is, each time a user selects a field to search, the query parser will check that that field is present in the @fields attribute, and if it isnât, it will raise an exception. So, if your index has a :title and a :content field and the user tries to search in a :contetn field (note the misspelling), the QueryParser will raise an exception. To make the QueryParser validate fields, you need to set the :validate_fields parameter to true. It is set to false by default.

Once you have specified which fields are available, you need to designate which of those fields you want to be searched by default. Simply set the parameter :default_field to a single field name or an array of field names. You can even set it to the symbol :*, which will specify that you want to search all fields by default. :* is in fact the default value.

Next, you must decide if you want Boolean queries to be OR or AND by default. This involves setting :or_default to true or false. By default, it is set to true, but if you want to make your search more like a regular search engine, you should set it to false.

The QueryParser handles parse errors for you by default. It does this by trying to parse the query according to the grammar. If that fails, it tries to parse it into a simple term or Boolean query, ignoring all query language tokens. If it still canât do that, it will return an empty Boolean query. No exception will be thrown and the user will just see an empty set of results. If youâd like to handle the parse errors yourself, you can set the parameter :handle_parse_errors to false. You can then let the user know that the query she entered was invalid.

Also, to make QueryParser more robust, it has a clean_string method that basically makes sure brackets and quotes match up and that all special characters within phrase strings are properly escaped. For example, the following query:

(city:Braidwood AND shop:(Torpy's OR "Pig & Whistle

will be cleaned up as:

(city:Braidwood AND shop:(Torpy's OR "Pig \& Whistle"))

Perhaps you want to clean the query strings yourself or you would prefer to have an exception raised if the query canât be parsed. To do this, set the :clean_string parameter to false.

Because MultiTermQueries have a :max_terms property, you can set the default value used for :max_terms by the query parser by setting its :max_clauses parameter. This will also affect the maximum number of clauses you can add to a BooleanQuery.

Ferret Query Language

The Ferret Query Language allows you to build most of the queries that you can build with Ruby code using just a simple query string. For simple queries, it matches what users have come to expect from years of using different search engines. But FQL allows you to build much more diverse queries than the usual search engine queries allow. FQL aims to be as concise as possible while still being readable and hopefully obvious to most users.

Query String Tokenization

Before we get into specific query types, we should first look at the way query strings are parsed. The following characters have special meaning in QueryParser:

            \ & : ( ) [ ] { } ! â ~ ^ | < > = * ? + -

Each of these characters, as well as whitespace, can be escaped with a backslash (\) to include it in a search term. Search terms are strings of characters separated by whitespace or by one of the special characters. Each search term will be further processed by the QueryParserâs analyzer, as described earlier in this chapter (unless it contains unescaped wildcard characters, such as * or ?). Here is an example of how a query string gets tokenized:

            'title:Shawshank\ Redeption +date:<20000604 +topic:(jail friendship)'

            =>  [
      'title', ':', 'Shawshank Redeption', '+', 'date', ':', '<',
      '20000604', '+', 'topic', ':', '(', 'jail', 'friendship', ')'
    ]

TermQuery

To express the simplest of all queries in FQL, simply type the term you wish to find:

            'ferret'

When parsed by the QueryParser, this string will be translated to a query that will search for the term âferretâ in all fields specified with :default_fields parameter.

To constrain your search to a field other than the field(s) specified by :default_fields, prefix your search with the field name followed by a colon. For example, if you want to search the :title field for the term âRubyâ, you would do so like this:

            'title:Ruby'

Searching multiple fields is easy, too. Simply separate field names with a | character. So, to search :title and :content fields for âRubyâ, you would type:

            'title|content:Ruby'

You can match all fields with the * character:

            '*:Ruby'

Thatâs all there is to specifying the field to search. If you want to search for documents that contain the term Ruby in both the :title and :content fields, you will need to use a Boolean query.

BooleanQuery

Most readers have used Boolean queries in search applications before. The most common syntax makes use of the + and â characters, + indicating terms that must occur and â indicating terms that must not occur. So, to search for documents on âFerretâ that preferably have the term âRubyâ and must not have the term âpetâ, you would type the following query:

            '+Ferret Ruby -pet'

+ and â can also be rewritten as âREQâ and âNOTâ, respectively. Ferret also supports the âANDâ and âORâ keywords. âANDâ has precedence over âORâ, but this can be overridden with the use of parentheses: ( and ). So, to search for a chocolate or caramel sundae, youâd type:

            '(chocolate OR caramel) AND sundae'

This could also be written as:

            '+(chocolate caramel) +sundae'

Itâs just a matter of personal preference.

Field constraints can be applied to individual terms or whole Boolean queries wrapped in brackets:

            '+flavour:(chocolate caramel) +name:sundae'

Inner field constraints override outer field constraints, so the following is equivalent to the previous query:

            'name:(flavour:(chocolate caramel) AND sundae)'

PhraseQuery

As you would expect, in FQL phrase queries are identified by " characters. So, to search for the phrase âquick brown foxâ, your query would be just that:

            '"quick brown fox"'

But Ferret phrase queries offer a lot more. You can specify a list of options for a term in a phrase. Letâs say we donât care if the fox is âredâ, âorangeâ, or âbrownâ. You could search for the following phrase:

            '"quick red|orange|brown fox"'

We could even accept absolutely anything in a termâs position. For example, the following would match âquick hungry foxâ:

            '"quick <> fox"'

In the PhraseQueryâ section earlier in this chapter, we also discussed sloppy phrase queries. The phrase slop can be indicated using the ~ character followed by an integer slop value. For example, the following query would match the phrase âquick brown and white foxâ:

            '"quick fox"~3'

As with other types of query, phrase queries can have field constraints applied to them:

            'content|title:"quick fox"~3'

RangeQuery

Range queries can be specified in a couple of different ways. The [] and {} brackets represent inclusive and exclusive limits, respectively. This syntax is inherited from the Apache Lucene query syntax. Letâs say you want all documents created on or after the 25th of April, 2006, and before the 11th of November (but not on that day). You would specify the query like this:

            'created_on:[20060425 20061111}'

In FQL, you can also express upper and lower bounded range queries. The open bounds are identified by the > and < tokens. For example, if I want all documents created after the 25th of July, 1977, I would write the query like this:

            'created_on:[20060725>'

To find all documents created before that date, you could type this:

            'created_on:<20060725}'

Alternatively, you can use the >, <, >=, and <= tokens to specify singly bounded range queries. The previous two queries would be, respectively:

            'created_on:>= 20060725'
'created_on:< 20060725'

WildcardQuery

Wildcard queries in Ferret make use of the * and ? characters. Just to reiterate what we covered in the WildcardQueryâ section earlier in this chapter, * will match any number of characters, whereas ? will match a single character only. So, the following query will match all documents with the terms âlendâ, âlegendâ, or âleadâ in the :content field:

            'content:l*e?d'

We can also use wildcard query syntax to create a few other types of queries. For example, to create a MatchAllQuery, you would type:

'*'

Note that it makes no difference if we add a field constraint to this query. To find all documents with a price field, you might be tempted to type the following:

            'price:*'

But this will match all documents. Instead, you need to type the following:

            'price:?*'

You can also create prefix queries using wildcard syntax. Simply type the prefix and append * to the end. For example:

            'category:/programming/ruby/*'

This query will be optimized into a PrefixQuery by the QueryParser.

FuzzyQuery

In the FuzzyQueryâ section earlier in this chapter, we said that FuzzyQueries are to TermQueries as sloppy PhraseQueries are to standard PhraseQueries, so it should come as no surprise that FuzzyQueries use the same syntax as sloppy PhraseQueries. Instead of a âslopâ integer, however, we have a âsimilarityâ float, which must be between 0.0 and 1.0. Another difference is that FuzzyQueries have a default similarity of 0.5, so you donât need to specify a similarity value at all. Letâs say, for example, that we wish to find all documents containing the commonly misspelled word âmischievousâ:

            'mischievous~'

Or we could make the query more strict by increasing the similarity value, like this:

            'mischievous~0.8'

Remember that FuzzyQueries are expensive queries to use on a large index, so you may want to set the default prefix length as described at the end of the FuzzyQueryâ section.

Boosting a query in FQL

Boosting queries in FQL is a simple matter of appending the query with ^ and a boost value (see the Boosting Queriesâ section earlier in this chapter). For example, letâs go back to our Boolean search for âFerretâ where the results included the term âRubyâ but not the term âpetâ. In this case, âFerretâ is the most important term, so we should boost it:

            '+Ferret^10.0 Ruby^0.1 -pet'

Note that it makes no sense to boost negative clauses in a boolean query. We should also note that the boost comes after the slop in a sloppy PhraseQuery and the similarity in a FuzzyQuery:

            '"quick brown fox"~5^10.0 AND date:>=20060601'

            'mischievous~0.8^10.0 AND date:>=20060601'

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Ferret by David Balmain