If you have a large corpus of text, you might want to categorize it into separate sections. This can be helpful for organization, or for text classification, which is covered in Chapter 7, Text Classification. The
brown corpus, for example, has a number of different categories, as shown in the following code:
>>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
In this recipe, we'll learn how to create our own categorized text corpus.
The easiest way to categorize a corpus is to have one file for each category. The following ...