Stemming

Stemming refers to the technique of reducing words to a common base or stem. For example, the words "connection", "connections", "connective", "connected", and "connecting" can all be reduced to their common stem of "connect". Stemming is not a perfect process, and stemming algorithms are liable to make mistakes. However, for the purposes of reducing the size of a dataset in order to train a machine learning model, it is a valuable technique. Using our previous example, our filtered list of stems would be ["Machin", "Learn", "Apach", "Spark"].

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.