Analyzer
An Analyzer
is basically a TokenStream
Factory that builds
TokenStreams
specific to each field. To implement an
Analyzer
, you simply need to supply a token_stream
method that accepts the following
parameters:
-
field_name
The name of the field to be tokenized (as a symbol)
-
input
The text to be tokenized
Ferret comes with a number of built-in
Analyzers
:
WhiteSpaceAnalyzer
(andAsciiWhiteSpaceAnalyzer
)LetterAnalyzer
(andAsciiLetterAnalyzer
)RegExpAnalyzer
StandardAnalyzer
(andAsciiStandardAnalyzer
)PerFieldAnalyzer
The first three are basically wrappers for their respective tokenizers, so we wonât cover them here except to mention that they are all lowercasing by default.
StandardAnalyzer
The easiest way to describe
StandardAnalyzer
is to show what it would look like
in Ruby code:
module
Ferret::Analysis
class
StandardAnalyzer
def
initialize
(
stop_words
=
ENGLISH_STOP_WORDS
,
lower
=
true
)
@lower
=
lower
@stop_words
=
stop_words
end
def
token_stream
(
field
,
str
)
ts
=
StandardTokenizer
.
new
(
str
)
ts
=
LowerCaseFilter
.
new
(
ts
)
if
@lower
ts
=
StopFilter
.
new
(
ts
,
@stop_words
)
ts
=
HyphenFilter
.
new
(
ts
)
end
end
end
The first thing you will probably notice is that
StandardAnalyzer
uses an English list of stop words. If you are
indexing in a different language, then you will probably want to use a
different list of stop words. If you donât want to use a
StopFilter
at all, you can either build a custom
Analyzer
or pass an empty Array
as
the stop words parameter:
analyzer
=
StandardAnalyzer ...
Get Ferret now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.