You are previewing Natural Language Processing with Python.

Natural Language Processing with Python

Cover of Natural Language Processing with Python by Ewan Klein... Published by O'Reilly Media, Inc.
  1. Natural Language Processing with Python
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Audience
    2. Emphasis
    3. What You Will Learn
    4. Organization
    5. Why Python?
    6. Software Requirements
    7. Natural Language Toolkit (NLTK)
    8. For Instructors
    9. Conventions Used in This Book
    10. Using Code Examples
    11. Safari® Books Online
    12. How to Contact Us
    13. Acknowledgments
    14. Royalties
  4. 1. Language Processing and Python
    1. Computing with Language: Texts and Words
      1. Getting Started with Python
      2. Getting Started with NLTK
      3. Searching Text
      4. Counting Vocabulary
    2. A Closer Look at Python: Texts as Lists of Words
      1. Lists
      2. Indexing Lists
      3. Variables
      4. Strings
    3. Computing with Language: Simple Statistics
      1. Frequency Distributions
      2. Fine-Grained Selection of Words
      3. Collocations and Bigrams
      4. Counting Other Things
    4. Back to Python: Making Decisions and Taking Control
      1. Conditionals
      2. Operating on Every Element
      3. Nested Code Blocks
      4. Looping with Conditions
    5. Automatic Natural Language Understanding
      1. Word Sense Disambiguation
      2. Pronoun Resolution
      3. Generating Language Output
      4. Machine Translation
      5. Spoken Dialogue Systems
      6. Textual Entailment
      7. Limitations of NLP
    6. Summary
    7. Further Reading
    8. Exercises
  5. 2. Accessing Text Corpora and Lexical Resources
    1. Accessing Text Corpora
      1. Gutenberg Corpus
      2. Web and Chat Text
      3. Brown Corpus
      4. Reuters Corpus
      5. Inaugural Address Corpus
      6. Annotated Text Corpora
      7. Corpora in Other Languages
      8. Text Corpus Structure
      9. Loading Your Own Corpus
    2. Conditional Frequency Distributions
      1. Conditions and Events
      2. Counting Words by Genre
      3. Plotting and Tabulating Distributions
      4. Generating Random Text with Bigrams
    3. More Python: Reusing Code
      1. Creating Programs with a Text Editor
      2. Functions
      3. Modules
    4. Lexical Resources
      1. Wordlist Corpora
      2. A Pronouncing Dictionary
      3. Comparative Wordlists
      4. Shoebox and Toolbox Lexicons
    5. WordNet
      1. Senses and Synonyms
      2. The WordNet Hierarchy
      3. More Lexical Relations
      4. Semantic Similarity
    6. Summary
    7. Further Reading
    8. Exercises
  6. 3. Processing Raw Text
    1. Accessing Text from the Web and from Disk
      1. Electronic Books
      2. Dealing with HTML
      3. Processing Search Engine Results
      4. Processing RSS Feeds
      5. Reading Local Files
      6. Extracting Text from PDF, MSWord, and Other Binary Formats
      7. Capturing User Input
      8. The NLP Pipeline
    2. Strings: Text Processing at the Lowest Level
      1. Basic Operations with Strings
      2. Printing Strings
      3. Accessing Individual Characters
      4. Accessing Substrings
      5. More Operations on Strings
      6. The Difference Between Lists and Strings
    3. Text Processing with Unicode
      1. What Is Unicode?
      2. Extracting Encoded Text from Files
      3. Using Your Local Encoding in Python
    4. Regular Expressions for Detecting Word Patterns
      1. Using Basic Metacharacters
      2. Ranges and Closures
    5. Useful Applications of Regular Expressions
      1. Extracting Word Pieces
      2. Doing More with Word Pieces
      3. Finding Word Stems
      4. Searching Tokenized Text
    6. Normalizing Text
      1. Stemmers
      2. Lemmatization
    7. Regular Expressions for Tokenizing Text
      1. Simple Approaches to Tokenization
      2. NLTK’s Regular Expression Tokenizer
      3. Further Issues with Tokenization
    8. Segmentation
      1. Sentence Segmentation
      2. Word Segmentation
    9. Formatting: From Lists to Strings
      1. From Lists to Strings
      2. Strings and Formats
      3. Lining Things Up
      4. Writing Results to a File
      5. Text Wrapping
    10. Summary
    11. Further Reading
    12. Exercises
  7. 4. Writing Structured Programs
    1. Back to the Basics
      1. Assignment
      2. Equality
      3. Conditionals
    2. Sequences
      1. Operating on Sequence Types
      2. Combining Different Sequence Types
      3. Generator Expressions
    3. Questions of Style
      1. Python Coding Style
      2. Procedural Versus Declarative Style
      3. Some Legitimate Uses for Counters
    4. Functions: The Foundation of Structured Programming
      1. Function Inputs and Outputs
      2. Parameter Passing
      3. Variable Scope
      4. Checking Parameter Types
      5. Functional Decomposition
      6. Documenting Functions
    5. Doing More with Functions
      1. Functions As Arguments
      2. Accumulative Functions
      3. Higher-Order Functions
      4. Named Arguments
    6. Program Development
      1. Structure of a Python Module
      2. Multimodule Programs
      3. Sources of Error
      4. Debugging Techniques
      5. Defensive Programming
    7. Algorithm Design
      1. Recursion
      2. Space-Time Trade-offs
      3. Dynamic Programming
    8. A Sample of Python Libraries
      1. Matplotlib
      2. NetworkX
      3. csv
      4. NumPy
      5. Other Python Libraries
    9. Summary
    10. Further Reading
    11. Exercises
  8. 5. Categorizing and Tagging Words
    1. Using a Tagger
    2. Tagged Corpora
      1. Representing Tagged Tokens
      2. Reading Tagged Corpora
      3. A Simplified Part-of-Speech Tagset
      4. Nouns
      5. Verbs
      6. Adjectives and Adverbs
      7. Unsimplified Tags
      8. Exploring Tagged Corpora
    3. Mapping Words to Properties Using Python Dictionaries
      1. Indexing Lists Versus Dictionaries
      2. Dictionaries in Python
      3. Defining Dictionaries
      4. Default Dictionaries
      5. Incrementally Updating a Dictionary
      6. Complex Keys and Values
      7. Inverting a Dictionary
    4. Automatic Tagging
      1. The Default Tagger
      2. The Regular Expression Tagger
      3. The Lookup Tagger
      4. Evaluation
    5. N-Gram Tagging
      1. Unigram Tagging
      2. Separating the Training and Testing Data
      3. General N-Gram Tagging
      4. Combining Taggers
      5. Tagging Unknown Words
      6. Storing Taggers
      7. Performance Limitations
      8. Tagging Across Sentence Boundaries
    6. Transformation-Based Tagging
    7. How to Determine the Category of a Word
      1. Morphological Clues
      2. Syntactic Clues
      3. Semantic Clues
      4. New Words
      5. Morphology in Part-of-Speech Tagsets
    8. Summary
    9. Further Reading
    10. Exercises
  9. 6. Learning to Classify Text
    1. Supervised Classification
      1. Gender Identification
      2. Choosing the Right Features
      3. Document Classification
      4. Part-of-Speech Tagging
      5. Exploiting Context
      6. Sequence Classification
      7. Other Methods for Sequence Classification
    2. Further Examples of Supervised Classification
      1. Sentence Segmentation
      2. Identifying Dialogue Act Types
      3. Recognizing Textual Entailment
      4. Scaling Up to Large Datasets
    3. Evaluation
      1. The Test Set
      2. Accuracy
      3. Precision and Recall
      4. Confusion Matrices
      5. Cross-Validation
    4. Decision Trees
      1. Entropy and Information Gain
    5. Naive Bayes Classifiers
      1. Underlying Probabilistic Model
      2. Zero Counts and Smoothing
      3. Non-Binary Features
      4. The Naivete of Independence
      5. The Cause of Double-Counting
    6. Maximum Entropy Classifiers
      1. The Maximum Entropy Model
      2. Maximizing Entropy
      3. Generative Versus Conditional Classifiers
    7. Modeling Linguistic Patterns
      1. What Do Models Tell Us?
    8. Summary
    9. Further Reading
    10. Exercises
  10. 7. Extracting Information from Text
    1. Information Extraction
      1. Information Extraction Architecture
    2. Chunking
      1. Noun Phrase Chunking
      2. Tag Patterns
      3. Chunking with Regular Expressions
      4. Exploring Text Corpora
      5. Chinking
      6. Representing Chunks: Tags Versus Trees
    3. Developing and Evaluating Chunkers
      1. Reading IOB Format and the CoNLL-2000 Chunking Corpus
      2. Simple Evaluation and Baselines
      3. Training Classifier-Based Chunkers
    4. Recursion in Linguistic Structure
      1. Building Nested Structure with Cascaded Chunkers
      2. Trees
      3. Tree Traversal
    5. Named Entity Recognition
    6. Relation Extraction
    7. Summary
    8. Further Reading
    9. Exercises
  11. 8. Analyzing Sentence Structure
    1. Some Grammatical Dilemmas
      1. Linguistic Data and Unlimited Possibilities
      2. Ubiquitous Ambiguity
    2. What’s the Use of Syntax?
      1. Beyond n-grams
    3. Context-Free Grammar
      1. A Simple Grammar
      2. Writing Your Own Grammars
      3. Recursion in Syntactic Structure
    4. Parsing with Context-Free Grammar
      1. Recursive Descent Parsing
      2. Shift-Reduce Parsing
      3. The Left-Corner Parser
      4. Well-Formed Substring Tables
    5. Dependencies and Dependency Grammar
      1. Valency and the Lexicon
      2. Scaling Up
    6. Grammar Development
      1. Treebanks and Grammars
      2. Pernicious Ambiguity
      3. Weighted Grammar
    7. Summary
    8. Further Reading
    9. Exercises
  12. 9. Building Feature-Based Grammars
    1. Grammatical Features
      1. Syntactic Agreement
      2. Using Attributes and Constraints
      3. Terminology
    2. Processing Feature Structures
      1. Subsumption and Unification
    3. Extending a Feature-Based Grammar
      1. Subcategorization
      2. Heads Revisited
      3. Auxiliary Verbs and Inversion
      4. Unbounded Dependency Constructions
      5. Case and Gender in German
    4. Summary
    5. Further Reading
    6. Exercises
  13. 10. Analyzing the Meaning of Sentences
    1. Natural Language Understanding
      1. Querying a Database
      2. Natural Language, Semantics, and Logic
    2. Propositional Logic
    3. First-Order Logic
      1. Syntax
      2. First-Order Theorem Proving
      3. Summarizing the Language of First-Order Logic
      4. Truth in Model
      5. Individual Variables and Assignments
      6. Quantification
      7. Quantifier Scope Ambiguity
      8. Model Building
    4. The Semantics of English Sentences
      1. Compositional Semantics in Feature-Based Grammar
      2. The λ-Calculus
      3. Quantified NPs
      4. Transitive Verbs
      5. Quantifier Ambiguity Revisited
    5. Discourse Semantics
      1. Discourse Representation Theory
      2. Discourse Processing
    6. Summary
    7. Further Reading
    8. Exercises
  14. 11. Managing Linguistic Data
    1. Corpus Structure: A Case Study
      1. The Structure of TIMIT
      2. Notable Design Features
      3. Fundamental Data Types
    2. The Life Cycle of a Corpus
      1. Three Corpus Creation Scenarios
      2. Quality Control
      3. Curation Versus Evolution
    3. Acquiring Data
      1. Obtaining Data from the Web
      2. Obtaining Data from Word Processor Files
      3. Obtaining Data from Spreadsheets and Databases
      4. Converting Data Formats
      5. Deciding Which Layers of Annotation to Include
      6. Standards and Tools
      7. Special Considerations When Working with Endangered Languages
    4. Working with XML
      1. Using XML for Linguistic Structures
      2. The Role of XML
      3. The ElementTree Interface
      4. Using ElementTree for Accessing Toolbox Data
      5. Formatting Entries
    5. Working with Toolbox Data
      1. Adding a Field to Each Entry
      2. Validating a Toolbox Lexicon
    6. Describing Language Resources Using OLAC Metadata
      1. What Is Metadata?
      2. OLAC: Open Language Archives Community
    7. Summary
    8. Further Reading
    9. Exercises
  15. A. Afterword: The Language Challenge
    1. Language Processing Versus Symbol Processing
    2. Contemporary Philosophical Divides
    3. NLTK Roadmap
    4. Envoi...
  16. B. Bibliography
  17. NLTK Index
  18. General Index
  19. About the Authors
  20. Colophon
  21. SPECIAL OFFER: Upgrade this ebook with O’Reilly

Automatic Natural Language Understanding

We have been exploring language bottom-up, with the help of texts and the Python programming language. However, we’re also interested in exploiting our knowledge of language and computation by building useful language technologies. We’ll take the opportunity now to step back from the nitty-gritty of code in order to paint a bigger picture of natural language processing.

At a purely practical level, we all need help to navigate the universe of information locked up in text on the Web. Search engines have been crucial to the growth and popularity of the Web, but have some shortcomings. It takes skill, knowledge, and some luck, to extract answers to such questions as: What tourist sites can I visit between Philadelphia and Pittsburgh on a limited budget? What do experts say about digital SLR cameras? What predictions about the steel market were made by credible commentators in the past week? Getting a computer to answer them automatically involves a range of language processing tasks, including information extraction, inference, and summarization, and would need to be carried out on a scale and with a level of robustness that is still beyond our current capabilities.

On a more philosophical level, a long-standing challenge within artificial intelligence has been to build intelligent machines, and a major part of intelligent behavior is understanding language. For many years this goal has been seen as too difficult. However, as NLP technologies become more mature, and robust methods for analyzing unrestricted text become more widespread, the prospect of natural language understanding has re-emerged as a plausible goal.

In this section we describe some language understanding technologies, to give you a sense of the interesting challenges that are waiting for you.

Word Sense Disambiguation

In word sense disambiguation we want to work out which sense of a word was intended in a given context. Consider the ambiguous words serve and dish:

Example 1-2. 

  1. serve: help with food or drink; hold an office; put ball into play

  2. dish: plate; course of a meal; communications device

In a sentence containing the phrase: he served the dish, you can detect that both serve and dish are being used with their food meanings. It’s unlikely that the topic of discussion shifted from sports to crockery in the space of three words. This would force you to invent bizarre images, like a tennis pro taking out his frustrations on a china tea-set laid out beside the court. In other words, we automatically disambiguate words using context, exploiting the simple fact that nearby words have closely related meanings. As another example of this contextual effect, consider the word by, which has several meanings, for example, the book by Chesterton (agentive—Chesterton was the author of the book); the cup by the stove (locative—the stove is where the cup is); and submit by Friday (temporal—Friday is the time of the submitting). Observe in Example 1-3 that the meaning of the italicized word helps us interpret the meaning of by.

Example 1-3. 

  1. The lost children were found by the searchers (agentive)

  2. The lost children were found by the mountain (locative)

  3. The lost children were found by the afternoon (temporal)

Pronoun Resolution

A deeper kind of language understanding is to work out “who did what to whom,” i.e., to detect the subjects and objects of verbs. You learned to do this in elementary school, but it’s harder than you might think. In the sentence the thieves stole the paintings, it is easy to tell who performed the stealing action. Consider three possible following sentences in Example 1-4, and try to determine what was sold, caught, and found (one case is ambiguous).

Example 1-4. 

  1. The thieves stole the paintings. They were subsequently sold.

  2. The thieves stole the paintings. They were subsequently caught.

  3. The thieves stole the paintings. They were subsequently found.

Answering this question involves finding the antecedent of the pronoun they, either thieves or paintings. Computational techniques for tackling this problem include anaphora resolution—identifying what a pronoun or noun phrase refers to—and semantic role labeling—identifying how a noun phrase relates to the verb (as agent, patient, instrument, and so on).

Generating Language Output

If we can automatically solve such problems of language understanding, we will be able to move on to tasks that involve generating language output, such as question answering and machine translation. In the first case, a machine should be able to answer a user’s questions relating to collection of texts:

Example 1-5. 

  1. Text: ... The thieves stole the paintings. They were subsequently sold. ...

  2. Human: Who or what was sold?

  3. Machine: The paintings.

The machine’s answer demonstrates that it has correctly worked out that they refers to paintings and not to thieves. In the second case, the machine should be able to translate the text into another language, accurately conveying the meaning of the original text. In translating the example text into French, we are forced to choose the gender of the pronoun in the second sentence: ils (masculine) if the thieves are sold, and elles (feminine) if the paintings are sold. Correct translation actually depends on correct understanding of the pronoun.

Example 1-6. 

  1. The thieves stole the paintings. They were subsequently found.

  2. Les voleurs ont volé les peintures. Ils ont été trouvés plus tard. (the thieves)

  3. Les voleurs ont volé les peintures. Elles ont été trouvées plus tard. (the paintings)

In all of these examples, working out the sense of a word, the subject of a verb, and the antecedent of a pronoun are steps in establishing the meaning of a sentence, things we would expect a language understanding system to be able to do.

Machine Translation

For a long time now, machine translation (MT) has been the holy grail of language understanding, ultimately seeking to provide high-quality, idiomatic translation between any pair of languages. Its roots go back to the early days of the Cold War, when the promise of automatic translation led to substantial government sponsorship, and with it, the genesis of NLP itself.

Today, practical translation systems exist for particular pairs of languages, and some are integrated into web search engines. However, these systems have some serious shortcomings. We can explore them with the help of NLTK’s “babelizer” (which is automatically loaded when you import this chapter’s materials using from import *). This program submits a sentence for translation into a specified language, then submits the resulting sentence for translation back into English. It stops after 12 iterations, or if it receives a translation that was produced already (indicating a loop):

>>> babelize_shell()
NLTK Babelizer: type 'help' for a list of commands.
Babel> how long before the next flight to Alice Springs?
Babel> german
Babel> run
0> how long before the next flight to Alice Springs?
1> wie lang vor dem folgenden Flug zu Alice Springs?
2> how long before the following flight to Alice jump?
3> wie lang vor dem folgenden Flug zu Alice springen Sie?
4> how long before the following flight to Alice do you jump?
5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen?
6> how long, before the following flight to Alice does, do you jump?
7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?
8> how long before the following flight to Alice does, do you jump?
9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?
10> how long, before the following flight does to Alice, do do you jump?
11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?
12> how long before the following flight does leap to Alice, does you?

Observe that the system correctly translates Alice Springs from English to German (in the line starting 1>), but on the way back to English, this ends up as Alice jump (line 2). The preposition before is initially translated into the corresponding German preposition vor, but later into the conjunction bevor (line 5). After line 5 the sentences become non-sensical (but notice the various phrasings indicated by the commas, and the change from jump to leap). The translation system did not recognize when a word was part of a proper name, and it misinterpreted the grammatical structure. The grammatical problems are more obvious in the following example. Did John find the pig, or did the pig find John?

>>> babelize_shell()
Babel> The pig that John found looked happy
Babel> german
Babel> run
0> The pig that John found looked happy
1> Das Schwein, das John fand, schaute gl?cklich
2> The pig, which found John, looked happy

Machine translation is difficult because a given word could have several possible translations (depending on its meaning), and because word order must be changed in keeping with the grammatical structure of the target language. Today these difficulties are being faced by collecting massive quantities of parallel texts from news and government websites that publish documents in two or more languages. Given a document in German and English, and possibly a bilingual dictionary, we can automatically pair up the sentences, a process called text alignment. Once we have a million or more sentence pairs, we can detect corresponding words and phrases, and build a model that can be used for translating new text.

Spoken Dialogue Systems

In the history of artificial intelligence, the chief measure of intelligence has been a linguistic one, namely the Turing Test: can a dialogue system, responding to a user’s text input, perform so naturally that we cannot distinguish it from a human-generated response? In contrast, today’s commercial dialogue systems are very limited, but still perform useful functions in narrowly defined domains, as we see here:

S: How may I help you?
U: When is Saving Private Ryan playing?
S: For what theater?
U: The Paramount theater.
S: Saving Private Ryan is not playing at the Paramount theater, but
it’s playing at the Madison theater at 3:00, 5:30, 8:00, and 10:30.

You could not ask this system to provide driving instructions or details of nearby restaurants unless the required information had already been stored and suitable question-answer pairs had been incorporated into the language processing system.

Observe that this system seems to understand the user’s goals: the user asks when a movie is showing and the system correctly determines from this that the user wants to see the movie. This inference seems so obvious that you probably didn’t notice it was made, yet a natural language system needs to be endowed with this capability in order to interact naturally. Without it, when asked, Do you know when Saving Private Ryan is playing?, a system might unhelpfully respond with a cold Yes. However, the developers of commercial dialogue systems use contextual assumptions and business logic to ensure that the different ways in which a user might express requests or provide information are handled in a way that makes sense for the particular application. So, if you type When is ..., or I want to know when ..., or Can you tell me when ..., simple rules will always yield screening times. This is enough for the system to provide a useful service.

Dialogue systems give us an opportunity to mention the commonly assumed pipeline for NLP. Figure 1-5 shows the architecture of a simple dialogue system. Along the top of the diagram, moving from left to right, is a “pipeline” of some language understanding components. These map from speech input via syntactic parsing to some kind of meaning representation. Along the middle, moving from right to left, is the reverse pipeline of components for converting concepts to speech. These components make up the dynamic aspects of the system. At the bottom of the diagram are some representative bodies of static information: the repositories of language-related data that the processing components draw on to do their work.

Simple pipeline architecture for a spoken dialogue system: Spoken input (top left) is analyzed, words are recognized, sentences are parsed and interpreted in context, application-specific actions take place (top right); a response is planned, realized as a syntactic structure, then to suitably inflected words, and finally to spoken output; different types of linguistic knowledge inform each stage of the process.

Figure 1-5. Simple pipeline architecture for a spoken dialogue system: Spoken input (top left) is analyzed, words are recognized, sentences are parsed and interpreted in context, application-specific actions take place (top right); a response is planned, realized as a syntactic structure, then to suitably inflected words, and finally to spoken output; different types of linguistic knowledge inform each stage of the process.


Your Turn: For an example of a primitive dialogue system, try having a conversation with an NLTK chatbot. To see the available chatbots, run (Remember to import nltk first.)

Textual Entailment

The challenge of language understanding has been brought into focus in recent years by a public “shared task” called Recognizing Textual Entailment (RTE). The basic scenario is simple. Suppose you want to find evidence to support the hypothesis: Sandra Goudie was defeated by Max Purnell, and that you have another short text that seems to be relevant, for example, Sandra Goudie was first elected to Parliament in the 2002 elections, narrowly winning the seat of Coromandel by defeating Labour candidate Max Purnell and pushing incumbent Green MP Jeanette Fitzsimons into third place. Does the text provide enough evidence for you to accept the hypothesis? In this particular case, the answer will be “No.” You can draw this conclusion easily, but it is very hard to come up with automated methods for making the right decision. The RTE Challenges provide data that allow competitors to develop their systems, but not enough data for “brute force” machine learning techniques (a topic we will cover in Chapter 6). Consequently, some linguistic analysis is crucial. In the previous example, it is important for the system to note that Sandra Goudie names the person being defeated in the hypothesis, not the person doing the defeating in the text. As another illustration of the difficulty of the task, consider the following text-hypothesis pair:

Example 1-7. 

  1. Text: David Golinkin is the editor or author of 18 books, and over 150 responsa, articles, sermons and books

  2. Hypothesis: Golinkin has written 18 books

In order to determine whether the hypothesis is supported by the text, the system needs the following background knowledge: (i) if someone is an author of a book, then he/she has written that book; (ii) if someone is an editor of a book, then he/she has not written (all of) that book; (iii) if someone is editor or author of 18 books, then one cannot conclude that he/she is author of 18 books.

Limitations of NLP

Despite the research-led advances in tasks such as RTE, natural language systems that have been deployed for real-world applications still cannot perform common-sense reasoning or draw on world knowledge in a general and robust manner. We can wait for these difficult artificial intelligence problems to be solved, but in the meantime it is necessary to live with some severe limitations on the reasoning and knowledge capabilities of natural language systems. Accordingly, right from the beginning, an important goal of NLP research has been to make progress on the difficult task of building technologies that “understand language,” using superficial yet powerful techniques instead of unrestricted knowledge and reasoning capabilities. Indeed, this is one of the goals of this book, and we hope to equip you with the knowledge and skills to build useful NLP systems, and to contribute to the long-term aspiration of building intelligent machines.

The best content for your career. Discover unlimited learning on demand for around $1/day.