This section discusses more advanced concepts, which you may prefer to skip on the first time through this chapter.
Tokenization is an instance of a more general problem of segmentation. In this section, we will look at two other instances of this problem, which use radically different techniques to the ones we have seen so far in this chapter.
Manipulating texts at the level of individual words often presupposes the ability to divide a text into individual sentences. As we have seen, some corpora already provide access at the sentence level. In the following example, we compute the average number of words per sentence in the Brown Corpus:
>>> len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) 20.250994070456922
In other cases, the text is available only as a stream of characters. Before tokenizing the text into words, we need to segment it into sentences. NLTK facilitates this by including the Punkt sentence segmenter (Kiss & Strunk, 2006). Here is an example of its use in segmenting the text of a novel. (Note that if the segmenter’s internal data has been updated by the time you read this, you will see different output.)
>>> sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') >>> text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt') >>> sents = sent_tokenizer.tokenize(text) >>> pprint.pprint(sents[171:181]) ['"Nonsense!', '" said Gregory, who was very rational when anyone else\nattempted paradox.', '"Why do all the clerks ...