Summary

  • A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown.

  • Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.

  • A conditional frequency distribution is a collection of frequency distributions, each one for a different condition. They can be used for counting word frequencies, given a context or a genre.

  • Python programs more than a few lines long should be entered using a text editor, saved to a file with a .py extension, and accessed using an import statement.

  • Python functions permit you to associate a name with a particular block of code, and reuse that code as often as necessary.

  • Some functions, known as “methods,” are associated with an object, and we give the object name followed by a period followed by the method name, like this: x.funct(y), e.g., word.isalpha().

  • To find out about some variable v, type help(v) in the Python interactive interpreter to read the help entry for this kind of object.

  • WordNet is a semantically oriented dictionary of English, consisting of synonym sets—or synsets—and organized into a network.

  • Some functions are not available by default, but must be accessed using Python’s import statement.

Get Natural Language Processing with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.