Finally, normalization refers to a wide variety of common techniques that are used to standardize text. Typical normalization techniques include converting all text to lowercase, removing selected characters, punctuation and other sequences of characters (typically using regular expressions), and expanding abbreviations by applying language-specific dictionaries of common abbreviations and slang terms.
Figure 6.1 illustrates a typical ordered preprocessing pipeline that is used to standardize raw written text: