Analyzer

An Analyzer is basically a TokenStream Factory that builds TokenStreams specific to each field. To implement an Analyzer, you simply need to supply a token_stream method that accepts the following parameters:

field_name: The name of the field to be tokenized (as a symbol)
input: The text to be tokenized

Ferret comes with a number of built-in Analyzers:

WhiteSpaceAnalyzer (and AsciiWhiteSpaceAnalyzer)
LetterAnalyzer (and AsciiLetterAnalyzer)
RegExpAnalyzer
StandardAnalyzer (and AsciiStandardAnalyzer)
PerFieldAnalyzer

The first three are basically wrappers for their respective tokenizers, so we wonât cover them here except to mention that they are all lowercasing by default.

StandardAnalyzer

The easiest way to describe StandardAnalyzer is to show what it would look like in Ruby code:

module Ferret::Analysis
  class StandardAnalyzer
    def initialize(stop_words = ENGLISH_STOP_WORDS, lower = true)
      @lower = lower
      @stop_words = stop_words
    end

    def token_stream(field, str)
      ts = StandardTokenizer.new(str)
      ts = LowerCaseFilter.new(ts) if @lower
      ts = StopFilter.new(ts, @stop_words)
      ts = HyphenFilter.new(ts)
    end
  end
end

The first thing you will probably notice is that StandardAnalyzer uses an English list of stop words. If you are indexing in a different language, then you will probably want to use a different list of stop words. If you donât want to use a StopFilter at all, you can either build a custom Analyzer or pass an empty Array as the stop words parameter:

analyzer = StandardAnalyzer ...

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Ferret by David Balmain

Analyzer

StandardAnalyzer

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly