Text preprocessing

Preprocessing the data is the process of cleaning and preparing the text for classification and derivation of meaning. Since our data may have a lot of noise, uninformative parts, such as HTML tags, need to be eliminated or re-aligned. At the word level, there might be many words that do not make much impact on the overall semantic of the textual context. Text preprocessing involves a few steps, such as extraction, tokenization, stop words removal, text enrichment, and normalization with stemming and lemmatization. In addition to these, some of the basic and generic techniques that improve accuracy involve converting the text to lower case, removing numbers (based on the context), removing punctuation, stripping white spaces ...

Get Artificial Intelligence for Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.