ChapterÂ 3.Â Advanced Indexing

So far, weâve taken a black-box approach to Ferret. This chapter explains what is really going on during indexing and, in the process, explains how to tune your index for maximum performance. We conclude by explaining how locking works. It is crucial that you understand this, particularly if you want to run Ferret in a multithreaded or multiprocess environment.

How the Indexing Process Works

We are now going to show how a source documentâsuch as an HTML document from the Web, a row from a database, or an image from your personal image collectionâbecomes a Ferret document stored in the index. Ferret is agnostic about the source documentâs type. It doesnât matter whether you are indexing an MP3 file, a text document, or your storeâs product, Ferret treats it as a collection of string fields. So, the first step is to turn source documents into Documents. This is pretty easy with plain-text documents. With other text document types, such as PDF or HTML, youâll need to write a parser/reader that extracts the searchable text from the documents. For an image file, you might have a parser that extracts EXIF tags. Database rows usually map pretty easily to Documents. See ChapterÂ 6 for a framework for doing exactly this.

Once you have a Document, you add it to an IndexWriter. This is where the magic begins. The Documentâs fields are passed through an analyzer (if they are set to be tokenized) that breaks up the fields into searchable tokens however ...

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Ferret by David Balmain

ChapterÂ 3.Â Advanced Indexing

How the Indexing Process Works

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly