Creating an analyzer

Analyzer's job is to analyse text. It enforces configured policies (IndexWriterConfig) on how index terms are extracted and tokenized from a raw text input. The output from Analyzer is a set of indexable tokens ready to be processed by the indexer. This step is necessary to ensure consistency in both the data store and search functionality. Also, note that Lucene only accepts plain text. Whatever your data type might be—be it XML, HTML, or PDF, you need to parse these documents into text before tossing them over to Lucene.

Imagine you have this piece of text: Lucene is an information retrieval library written in Java. An analyzer will tokenize this text, manipulate the data to conform to a certain data formatting policy (for ...

Get Lucene 4 Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.