What's text normalization?

Text normalization is the process of transforming text into a common form. That is necessary in order to remove insignificant differences among identical words.

Let's look at déjà-vu word to handle.

The word deja-vu is not equal to déjà-vu for string comparison. Even Déjà-vu is not equal to déjà-vu. Similarly, Michè'le is not equal to Michèle. All these words (that is, tokens) are not equal because the comparison is made at the byte-level by Elasticsearch. This means, for two tokens to be considered the same, they need to consist of exactly the same bytes when these tokens are compared.

However, these words have similar meanings. In other words, the same thing is being sought when a user is searching for the word déjà-vu ...

Get Elasticsearch Indexing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.