Preprocessing pipeline

One of the basic tasks involved in NLP is the preprocessing of your documents in an attempt to standardize the text from different sources as much as possible. Not only does preprocessing help us to standardize text, it often reduces the size of the raw text, thereby reducing the computational complexity of subsequent processes and models. The following subsections describe common preprocessing techniques that may constitute a typical ordered preprocessing pipeline.

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.