NLTK was originally created in 2001 as part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania. Since then it has been developed and expanded with the help of dozens of contributors. It has now been adopted in courses in dozens of universities, and serves as the basis of many research projects. Table 2 lists the most important NLTK modules.
Table 2. Language processing tasks and corresponding NLTK modules with examples of functionality
Language processing task | NLTK modules | Functionality |
---|---|---|
Accessing corpora | nltk.corpus | Standardized interfaces to corpora and lexicons |
String processing | nltk.tokenize, nltk.stem | Tokenizers, sentence tokenizers, stemmers |
Collocation discovery | nltk.collocations | t-test, chi-squared, point-wise mutual information |
Part-of-speech tagging | nltk.tag | n-gram, backoff, Brill, HMM, TnT |
Classification | nltk.classify, nltk.cluster | Decision tree, maximum entropy, naive Bayes, EM, k-means |
Chunking | nltk.chunk | Regular expression, n-gram, named entity |
Parsing | nltk.parse | Chart, feature-based, unification, probabilistic, dependency |
Semantic interpretation | nltk.sem, nltk.inference | Lambda calculus, first-order logic, model checking |
Evaluation metrics | nltk.metrics | Precision, recall, agreement coefficients |
Probability and estimation | nltk.probability | Frequency distributions, smoothed probability distributions |
Applications | nltk.app, nltk.chat | Graphical concordancer, parsers, WordNet browser, chatbots |
Linguistic fieldwork | nltk.toolbox | Manipulate data in SIL Toolbox format |
NLTK was designed with four primary goals in mind:
- Simplicity
To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data
- Consistency
To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names
- Extensibility
To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task
- Modularity
To provide components that can be used independently without needing to understand the rest of the toolkit
Contrasting with these goals are three non-requirements—potentially useful qualities that we have deliberately avoided. First, while the toolkit provides a wide range of functions, it is not encyclopedic; it is a toolkit, not a system, and it will continue to evolve with the field of NLP. Second, while the toolkit is efficient enough to support meaningful tasks, it is not highly optimized for runtime performance; such optimizations often involve more complex algorithms, or implementations in lower-level programming languages such as C or C++. This would make the software less readable and more difficult to install. Third, we have tried to avoid clever programming tricks, since we believe that clear implementations are preferable to ingenious yet indecipherable ones.
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.