Tokenizers

The function of a tokenizer is to break input text into tokens, where each token is a stream of characters in the text. You configure a tokenizer for a text field type in schema.xml with a <tokenizer> element, which is a child of <analyzer>, like this for example:

<fieldType name="text" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
    </analyzer>
</fieldType>

In the preceding example, you can see that a class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement org.apache.solr.analysis.TokenizerFactory. You can pass arguments to tokenizer factories by setting attributes in ...

Get Apache Solr for Indexing Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.