Setting up a custom corpus
A corpus is a collection of text documents, and corpora is the plural of corpus. This comes from the Latin word for body; in this case, a body of text. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.
Getting ready
You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C:\nltk_data
on Windows, and /usr/share/nltk_data
on Linux, Unix, and Mac OS X.
How to do it...
NLTK defines a list of data directories, or paths, in nltk.data.path
. Our custom corpora must be within one of these paths so it can be found by NLTK. In order to avoid conflict with the ...
Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.