Removal of punctuation and symbols can be a difficult problem. While they don't add value to many searches, punctuation can also be required to be kept as part of a token. Take the case of searching a job site and trying to find C# programming positions, such as in the example in this recipe. The tokenization of C# gets split into two tokens:
>>> word_tokenize("C#")['C', '#']
We actually have two problems here. By having C and # separated, we lost knowledge of C# being in the source content. And then if we removed the # from the tokens, then we lose that information as we also cannot reconstruct C# from adjacent tokens.