Analyzer

An Analyzer is basically a TokenStream Factory that builds TokenStreams specific to each field. To implement an Analyzer, you simply need to supply a token_stream method that accepts the following parameters:

field_name

The name of the field to be tokenized (as a symbol)

input

The text to be tokenized

Ferret comes with a number of built-in Analyzers:

  • WhiteSpaceAnalyzer (and AsciiWhiteSpaceAnalyzer)

  • LetterAnalyzer (and AsciiLetterAnalyzer)

  • RegExpAnalyzer

  • StandardAnalyzer (and AsciiStandardAnalyzer)

  • PerFieldAnalyzer

The first three are basically wrappers for their respective tokenizers, so we won’t cover them here except to mention that they are all lowercasing by default.

StandardAnalyzer

The easiest way to describe StandardAnalyzer is to show what it would look like in Ruby code:

module Ferret::Analysis
  class StandardAnalyzer
    def initialize(stop_words = ENGLISH_STOP_WORDS, lower = true)
      @lower = lower
      @stop_words = stop_words
    end

    def token_stream(field, str)
      ts = StandardTokenizer.new(str)
      ts = LowerCaseFilter.new(ts) if @lower
      ts = StopFilter.new(ts, @stop_words)
      ts = HyphenFilter.new(ts)
    end
  end
end

The first thing you will probably notice is that StandardAnalyzer uses an English list of stop words. If you are indexing in a different language, then you will probably want to use a different list of stop words. If you don’t want to use a StopFilter at all, you can either build a custom Analyzer or pass an empty Array as the stop words parameter:

analyzer = StandardAnalyzer ...

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.