Stemming and lemmatization

Text documents can contain words in different forms, such as play, playing, and played. They are similar and they have a common root.

Stemming and lemmatization are techniques that are used to find these common roots. Finding the roots will help us count, play, playing, and played as a single entity as all the words talk about play.

Stemming is more of a crude form of arriving at the root of a word; so, in the case of the preceding example, playing would be reduced to play. Lemmatization brings into context words, such as worse and bad, that can have a common bad root.

Stemming

Stemming is a process of reducing a word to its root form. The root form is not a word by itself, but words can be formed by adding the right suffix ...

Get Mastering Python for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.