Highlighting Query Results

Query highlighting, like excerpting, is one of the newer features in Ferret, added in version 0.10. Highlighting takes a query and returns the data from a document field with all of the matches in the field highlighted. Excerpting, on the other hand, takes excerpts from the field, preferably with matching terms, and highlights the terms in those excerpts. Both Ferret::Search::Searcher and Ferret::Index::Index classes have a highlight method. In this section, we’ll look at Index#highlight because it allows us to pass string queries instead of having to build Query objects (see Table 4-3). Otherwise, both methods are essentially the same. To use the highlight method, you must supply a query and the document ID of the document you wish to highlight. A number of other parameters can be used to describe exactly how you want to highlight the field.

Table 4-3. Index#highlight parameters

ParameterDescription
:field Defaults to @options[:default_field]. The highlighter only works on one field at a time, so you need to specify which field it is you want to highlight. If you want to highlight multiple fields, you'll need to call this method multiple times.
:excerpt_length Defaults to 150 bytes. This parameter specifies the length of excerpt to show. The algorithm for extracting excerpts attempts to fit as many matched terms into each excerpt as possible. If you’d simply like the complete field back with all matches highlighted, set this parameter to :all.
:num_excerpts Specifies the number of excerpts you wish to retrieve. This defaults to 2, unless :excerpt_length is set to :all, in which case :num_excerpts is automatically set to 1.
:pre_tag To highlight matches, you need to specify short strings to place before and after matches. :pre_tag defaults to <b>, which is fine when printing HTML, but if you are printing results to the console, we recommend using something like \033[36m.
:post_tag Defaults to </b>. This tag should close whatever you specified in :pre_tag. Try tag \033[m for console applications.
:ellipsis Defaults funnily enough to .... This is the string that is appended at the beginning and end of excerpts where the excerpts break in the middle of a field. Alternatively, you may want to use the HTML entity &#8230; or the UTF-8 string \342\200\246.

The highlight method returns an array of strings, the strings being the extracted excerpts. Example 4-1 demonstrates the flexibility of Ferret’s highlighting. We store the optional parameters in a hash to avoid specifying them for each call to the highlight method. We also use a StemmingAnalyzer to demonstrate that phrases don’t need to be exact to match. Don’t worry about how this works just yet. You’ll learn more about analysis in the next chapter.

Example 4-1. Query highlighter

require 'rubygems'
require 'ferret'

class MyAnalyzer < Ferret::Analysis::StandardAnalyzer
  def token_stream(field, input)
    Ferret::Analysis::StemFilter.new(super)
  end
end

index = Ferret::I.new(:analyzer => MyAnalyzer.new)

index << {
  :title => "Mark Twain Excerpts",
  :content => <<-EOF
 If it had not been for him, with his incendiary "Early to bed and
 early to rise," and all that sort of foolishness, I wouldn't have
 been so harried and worried and raked out of bed at such unseemly
 hours when I was young. The late Franklin was well enough in his
 way; but it would have looked more dignified in him to have gone on
 making candles and letting other people get up when they wanted to.
 - Letter from Mark Twain, San Francisco Alta California, July 25, 1869 

 When one receives a letter from a great man for the first time in
 his life, it is a large event to him, as all of you know by your own
 experience. You never can receive letters enough from famous men
 afterward to obliterate that one, or dim the memory of the pleasant
  surprise it was, and the gratification it gave you.
   - Mark Twain's Speeches, "Unconscious Plagiarism"
EOF
}

options = {
  :field => :content,
  :pre_tag => "\033[36m",
  :post_tag => "\033[m",
  :ellipsis => " \342\200\246 "
}
query = '"Early <> Bed" "receive letter"~1 Twain early'

puts "_" * 60 + "\n\t*** Extract two excerpts ***\n\n"
puts index.highlight(query, 0, options)

puts "_" * 60 + "\n\t*** Extract four smaller excerpts ***\n\n"
options[:num_excerpts] = 4
options[:excerpt_length] = 50
puts index.highlight(query, 0, options)

puts "_" * 60 + "\n\t*** Highlight the entire field ***\n\n"
options[:excerpt_length] = :all
puts index.highlight(query, 0, options)

You’ll notice here that the second example that’s supposed to extract four excerpts of length 50 bytes actually extracts two excerpts of 50 bytes and one of 100 bytes. The excerpting algorithm works by attempting to place the excerpts so that the maximum number of matched terms will be shown. If it can concatenate two or more excerpts without reducing the number of matched terms shown, it will.

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.