Treebank construction

The nltk.corpus.package consists of a number of corpus readerclasses that can be used to obtain the contents of various corpora.

Treebank corpus can also be accessed from nltk.corpus. Identifiers for files can be obtained using fileids():

>>> import nltk >>> import nltk.corpus >>> print(str(nltk.corpus.treebank).replace('\\\\','/')) <BracketParseCorpusReader in 'C:/nltk_data/corpora/treebank/combined'> >>> nltk.corpus.treebank.fileids() ['wsj_0001.mrg', 'wsj_0002.mrg', 'wsj_0003.mrg', 'wsj_0004.mrg', 'wsj_0005.mrg', 'wsj_0006.mrg', 'wsj_0007.mrg', 'wsj_0008.mrg', 'wsj_0009.mrg', 'wsj_0010.mrg', 'wsj_0011.mrg', 'wsj_0012.mrg', 'wsj_0013.mrg', 'wsj_0014.mrg', 'wsj_0015.mrg', 'wsj_0016.mrg', 'wsj_0017.mrg', 'wsj_0018.mrg', 'wsj_0019.mrg', ...

Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.