Normalization

Finally, normalization refers to a wide variety of common techniques that are used to standardize text. Typical normalization techniques include converting all text to lowercase, removing selected characters, punctuation and other sequences of characters (typically using regular expressions), and expanding abbreviations by applying language-specific dictionaries of common abbreviations and slang terms.

Figure 6.1 illustrates a typical ordered preprocessing pipeline that is used to standardize raw written text:

Figure 6.1: Typical preprocessing pipeline

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.