TokenStream

A TokenStream takes a field and turns it into a list of tokens. To implement a TokenStream, you need to supply two methods. TokenStream#next should return Tokens in the order followed by nil when there are no more tokens left in the field. TokenStream#text= is used to set the text that the TokenStream will analyze. In Ferret, there are two types of TokenStreams: Tokenizers and TokenFilters.

In the next two sections, we’ll make use of the following test code to test each TokenStream, printing the tokens in a table:

def test_token_stream(token_stream)
  puts "\033[32mStart | End | PosInc | Text\033[m"
  while t = token_stream.next
    puts "%5d |%4d |%5d   | %s" % [t.start, t.end, t.pos_inc, t.text]
  end
end

Tokenizer

Tokenizers take the raw text data from a field and turn it into a list of Tokens. Ferret comes with a number of tokenizer implementations, including:

  • WhiteSpaceTokenizer (and AsciiWhiteSpaceTokenizer)

  • LetterTokenizer (and AsciiLetterTokenizer)

  • StandardTokenizer (and AsciiStandardTokenizer)

  • RegExpTokenizer

Where an ASCII tokenizer exists, the non-ASCII tokenizer is locale-sensitive. That means that the tokenizer will recognize letters, numbers, and whitespace as specified by your locale. If your locale is set to UTF-8, then the tokenizer will recognize UTF-8 characters. This means you need to make sure that the data you are feeding Ferret is in the correct encoding according to your locale; otherwise, you could wind up running into some strange errors. The ASCII tokenizers ...

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.