Natural Language Toolkit (NLTK)

NLTK was originally created in 2001 as part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania. Since then it has been developed and expanded with the help of dozens of contributors. It has now been adopted in courses in dozens of universities, and serves as the basis of many research projects. Table 2 lists the most important NLTK modules.

Table 2. Language processing tasks and corresponding NLTK modules with examples of functionality

Language processing task

NLTK modules

Functionality

Accessing corpora

nltk.corpus

Standardized interfaces to corpora and lexicons

String processing

nltk.tokenize, nltk.stem

Tokenizers, sentence tokenizers, stemmers

Collocation discovery

nltk.collocations

t-test, chi-squared, point-wise mutual information

Part-of-speech tagging

nltk.tag

n-gram, backoff, Brill, HMM, TnT

Classification

nltk.classify, nltk.cluster

Decision tree, maximum entropy, naive Bayes, EM, k-means

Chunking

nltk.chunk

Regular expression, n-gram, named entity

Parsing

nltk.parse

Chart, feature-based, unification, probabilistic, dependency

Semantic interpretation

nltk.sem, nltk.inference

Lambda calculus, first-order logic, model checking

Evaluation metrics

nltk.metrics

Precision, recall, agreement coefficients

Probability and estimation

nltk.probability

Frequency distributions, smoothed probability distributions

Applications

nltk.app, nltk.chat

Graphical concordancer, parsers, WordNet browser, chatbots

Linguistic fieldwork

nltk.toolbox

Manipulate data in SIL Toolbox format

NLTK was designed with four primary goals in mind:

Simplicity

To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data

Consistency

To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names

Extensibility

To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task

Modularity

To provide components that can be used independently without needing to understand the rest of the toolkit

Contrasting with these goals are three non-requirements—potentially useful qualities that we have deliberately avoided. First, while the toolkit provides a wide range of functions, it is not encyclopedic; it is a toolkit, not a system, and it will continue to evolve with the field of NLP. Second, while the toolkit is efficient enough to support meaningful tasks, it is not highly optimized for runtime performance; such optimizations often involve more complex algorithms, or implementations in lower-level programming languages such as C or C++. This would make the software less readable and more difficult to install. Third, we have tried to avoid clever programming tricks, since we believe that clear implementations are preferable to ingenious yet indecipherable ones.

Get Natural Language Processing with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.