Custom Analysis

The Tokenizers and Analyzers that come with Ferret cater to most needs most of the time. However, there may come a time when Ferret’s standard Analysis classes fall short and you need to implement your own. In the following example, we’ll show you how to build an Analyzer that automatically pads numbers to a fixed width so that they will be correctly sorted for use with a RangeQuery or RangeFilter:

module Ferret::Analysis
  class IntegerTokenizer
    def initialize(num, width)
      @num = num.to_i
      @width = width
    end
    def next
      token = Token.new("%0#{@width}d" % @num, 0, @width) if @num
      @num = nil
      return token
    end
    def text=(text)
      @num = text.to_i
    end
  end

  class IntegerAnalyzer
    def initialize(width)
      @width = width
    end
    def token_stream(field, input)
      return IntegerTokenizer.new(input, @width)
    end
  end
end

include Ferret::Analysis

analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)
analyzer[:padded] = IntegerAnalyzer.new(5)

index = Ferret::I.new(:analyzer => analyzer)
[5, 50, 500, 5000, 50000].each do |number|
  index << {:padded => number, :unpadded => number}
end

puts "padded: " + index.search('padded:[10 10000]').to_s
puts "unpadded: " + index.search('unpadded:[10 10000]').to_s

If you run this example, you’ll see that the RangeQuery worked on the :padded_num field, even though you didn’t explicitly pad the numbers as you added them to the field or the numbers in the RangeQuery itself. The :num field query, on the other hand, failed miserably.

In this chapter, we have covered ...

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.