Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.
The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice’s Adventures in Wonderland:
>>> raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone ... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very ... well without--Maybe it's always pepper that makes people hot-tempered,'..."""
We could split this raw text on whitespace using
raw.split(). To do the same using a regular
expression, it is not enough to match any space characters in the
string , since this results in
tokens that contain a
character; instead, we need to match any number of spaces, tabs, or
>>> re.split(r' ', raw) ["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', ...