Removing stop words

Stop words are words that occur more frequently in the sentence and make the text heavier and less important for the analysis, they should be excluded from the input. Having stop words in your text confuses your algorithm as these stop words do not have contextual meaning and increase dimensional features of your term vectors. Therefore, it is imperative that these stop words be removed for better model accuracy. Examples of stop words are I, am, is, and the. One of the ways to remove the stop words is to have a precompiled list of the stop words and then remove those stop words from the document (text used to train the model).

With Spark, we can use the StopWordsRemover library, which has its own list of default stop ...

Get Artificial Intelligence for Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.