Chapter 20. Normalizing Tokens

Breaking text into tokens is only half the job. To make those tokens more easily searchable, they need to go through a normalization process to remove insignificant differences between otherwise identical words, such as uppercase versus lowercase. Perhaps we also need to remove significant differences, to make esta, ésta, and está all searchable as the same word. Would you search for déjà vu, or just for deja vu?

This is the job of the token filters, which receive a stream of tokens from the tokenizer. You can have multiple token filters, each doing its particular job. Each receives the new token stream as output by the token filter before it.

In That Case

The most frequently used token filter is the lowercase filter, which does exactly what you would expect; it transforms each token into its lowercase form:

GET /_analyze?tokenizer=standard&filters=lowercase
The QUICK Brown FOX! 1
1

Emits tokens the, quick, brown, fox

It doesn’t matter whether users search for fox or FOX, as long as the same analysis process is applied at query time and at search time. The lowercase filter will transform a query for FOX into a query for fox, which is the same token that we have stored in our inverted index.

To use token filters as part of the analysis process, we

Get Elasticsearch: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.