ChapterÂ 5.Â Analysis

Analysis is the foundation of any search library. It is the process of taking an input field and breaking it up into tokens to be added to the inverted index. So, why did we wait until now to cover this important subject? Most of the time, Ferretâs standard analyzer will do exactly what you need it to do. However, when it doesnât, Ferretâs analysis API is very easy to extend to your needs. To understand the analysis API, you need to know about three classes:

Token
TokenStream
Analyzer

Token

The Token is the basic datatype in analysis. It is basically just a Struct with four attributes:

Text
Start offset
End offset
Position increment

The text attribute is obviously a String holding the tokenâs text. Ferret allows tokens of up to 255 bytes long. Any longer than that and the text gets truncated to that length.

The start and end offsets hold the byte positions of the start and end of the token in the original field, the end being the byte immediately after the last byte in the token. For example, in the string âThe Old Man and the Seaâ, the âOldâ token has a start offset of 4 and an end offset of 7. The difference between the start offset and the end offset is usually equal to the length of the tokenâs text, but not always. For example, Ferretâs standard analyzer strips possessives (âs). In the field âJamieâs Kitchenâ, for instance, the first token will be âJamieâ but the start and end offset will be 0 and 7, respectively, also encompassing ...

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Ferret by David Balmain

ChapterÂ 5.Â Analysis

Token

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly