Defining custom tokenizers

Although there are several excellent built-in tokenizers in Lucene, you may still find yourself needing something to behave slightly differently. You will then have to custom-build a Tokenizer. Lucene provides a character-based tokenizer called CharTokenizer that should be suitable for most types of tokenizations. You can override its isTokenChar method to determine what characters should be considered as part of a token and what characters should be considered as delimiters. It's worthwhile to note that both LetterTokenizer and WhitespaceTokenizer extend from CharTokenizer.

How to do it…

In this example, we will create our own tokenizer that splits text by space only. It is similar to WhitespaceTokenizer but this one ...

Get Lucene 4 Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.