You are previewing Natural Language Processing with Python.

Natural Language Processing with Python

Cover of Natural Language Processing with Python by Ewan Klein... Published by O'Reilly Media, Inc.
  1. Natural Language Processing with Python
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Audience
    2. Emphasis
    3. What You Will Learn
    4. Organization
    5. Why Python?
    6. Software Requirements
    7. Natural Language Toolkit (NLTK)
    8. For Instructors
    9. Conventions Used in This Book
    10. Using Code Examples
    11. Safari® Books Online
    12. How to Contact Us
    13. Acknowledgments
    14. Royalties
  4. 1. Language Processing and Python
    1. Computing with Language: Texts and Words
      1. Getting Started with Python
      2. Getting Started with NLTK
      3. Searching Text
      4. Counting Vocabulary
    2. A Closer Look at Python: Texts as Lists of Words
      1. Lists
      2. Indexing Lists
      3. Variables
      4. Strings
    3. Computing with Language: Simple Statistics
      1. Frequency Distributions
      2. Fine-Grained Selection of Words
      3. Collocations and Bigrams
      4. Counting Other Things
    4. Back to Python: Making Decisions and Taking Control
      1. Conditionals
      2. Operating on Every Element
      3. Nested Code Blocks
      4. Looping with Conditions
    5. Automatic Natural Language Understanding
      1. Word Sense Disambiguation
      2. Pronoun Resolution
      3. Generating Language Output
      4. Machine Translation
      5. Spoken Dialogue Systems
      6. Textual Entailment
      7. Limitations of NLP
    6. Summary
    7. Further Reading
    8. Exercises
  5. 2. Accessing Text Corpora and Lexical Resources
    1. Accessing Text Corpora
      1. Gutenberg Corpus
      2. Web and Chat Text
      3. Brown Corpus
      4. Reuters Corpus
      5. Inaugural Address Corpus
      6. Annotated Text Corpora
      7. Corpora in Other Languages
      8. Text Corpus Structure
      9. Loading Your Own Corpus
    2. Conditional Frequency Distributions
      1. Conditions and Events
      2. Counting Words by Genre
      3. Plotting and Tabulating Distributions
      4. Generating Random Text with Bigrams
    3. More Python: Reusing Code
      1. Creating Programs with a Text Editor
      2. Functions
      3. Modules
    4. Lexical Resources
      1. Wordlist Corpora
      2. A Pronouncing Dictionary
      3. Comparative Wordlists
      4. Shoebox and Toolbox Lexicons
    5. WordNet
      1. Senses and Synonyms
      2. The WordNet Hierarchy
      3. More Lexical Relations
      4. Semantic Similarity
    6. Summary
    7. Further Reading
    8. Exercises
  6. 3. Processing Raw Text
    1. Accessing Text from the Web and from Disk
      1. Electronic Books
      2. Dealing with HTML
      3. Processing Search Engine Results
      4. Processing RSS Feeds
      5. Reading Local Files
      6. Extracting Text from PDF, MSWord, and Other Binary Formats
      7. Capturing User Input
      8. The NLP Pipeline
    2. Strings: Text Processing at the Lowest Level
      1. Basic Operations with Strings
      2. Printing Strings
      3. Accessing Individual Characters
      4. Accessing Substrings
      5. More Operations on Strings
      6. The Difference Between Lists and Strings
    3. Text Processing with Unicode
      1. What Is Unicode?
      2. Extracting Encoded Text from Files
      3. Using Your Local Encoding in Python
    4. Regular Expressions for Detecting Word Patterns
      1. Using Basic Metacharacters
      2. Ranges and Closures
    5. Useful Applications of Regular Expressions
      1. Extracting Word Pieces
      2. Doing More with Word Pieces
      3. Finding Word Stems
      4. Searching Tokenized Text
    6. Normalizing Text
      1. Stemmers
      2. Lemmatization
    7. Regular Expressions for Tokenizing Text
      1. Simple Approaches to Tokenization
      2. NLTK’s Regular Expression Tokenizer
      3. Further Issues with Tokenization
    8. Segmentation
      1. Sentence Segmentation
      2. Word Segmentation
    9. Formatting: From Lists to Strings
      1. From Lists to Strings
      2. Strings and Formats
      3. Lining Things Up
      4. Writing Results to a File
      5. Text Wrapping
    10. Summary
    11. Further Reading
    12. Exercises
  7. 4. Writing Structured Programs
    1. Back to the Basics
      1. Assignment
      2. Equality
      3. Conditionals
    2. Sequences
      1. Operating on Sequence Types
      2. Combining Different Sequence Types
      3. Generator Expressions
    3. Questions of Style
      1. Python Coding Style
      2. Procedural Versus Declarative Style
      3. Some Legitimate Uses for Counters
    4. Functions: The Foundation of Structured Programming
      1. Function Inputs and Outputs
      2. Parameter Passing
      3. Variable Scope
      4. Checking Parameter Types
      5. Functional Decomposition
      6. Documenting Functions
    5. Doing More with Functions
      1. Functions As Arguments
      2. Accumulative Functions
      3. Higher-Order Functions
      4. Named Arguments
    6. Program Development
      1. Structure of a Python Module
      2. Multimodule Programs
      3. Sources of Error
      4. Debugging Techniques
      5. Defensive Programming
    7. Algorithm Design
      1. Recursion
      2. Space-Time Trade-offs
      3. Dynamic Programming
    8. A Sample of Python Libraries
      1. Matplotlib
      2. NetworkX
      3. csv
      4. NumPy
      5. Other Python Libraries
    9. Summary
    10. Further Reading
    11. Exercises
  8. 5. Categorizing and Tagging Words
    1. Using a Tagger
    2. Tagged Corpora
      1. Representing Tagged Tokens
      2. Reading Tagged Corpora
      3. A Simplified Part-of-Speech Tagset
      4. Nouns
      5. Verbs
      6. Adjectives and Adverbs
      7. Unsimplified Tags
      8. Exploring Tagged Corpora
    3. Mapping Words to Properties Using Python Dictionaries
      1. Indexing Lists Versus Dictionaries
      2. Dictionaries in Python
      3. Defining Dictionaries
      4. Default Dictionaries
      5. Incrementally Updating a Dictionary
      6. Complex Keys and Values
      7. Inverting a Dictionary
    4. Automatic Tagging
      1. The Default Tagger
      2. The Regular Expression Tagger
      3. The Lookup Tagger
      4. Evaluation
    5. N-Gram Tagging
      1. Unigram Tagging
      2. Separating the Training and Testing Data
      3. General N-Gram Tagging
      4. Combining Taggers
      5. Tagging Unknown Words
      6. Storing Taggers
      7. Performance Limitations
      8. Tagging Across Sentence Boundaries
    6. Transformation-Based Tagging
    7. How to Determine the Category of a Word
      1. Morphological Clues
      2. Syntactic Clues
      3. Semantic Clues
      4. New Words
      5. Morphology in Part-of-Speech Tagsets
    8. Summary
    9. Further Reading
    10. Exercises
  9. 6. Learning to Classify Text
    1. Supervised Classification
      1. Gender Identification
      2. Choosing the Right Features
      3. Document Classification
      4. Part-of-Speech Tagging
      5. Exploiting Context
      6. Sequence Classification
      7. Other Methods for Sequence Classification
    2. Further Examples of Supervised Classification
      1. Sentence Segmentation
      2. Identifying Dialogue Act Types
      3. Recognizing Textual Entailment
      4. Scaling Up to Large Datasets
    3. Evaluation
      1. The Test Set
      2. Accuracy
      3. Precision and Recall
      4. Confusion Matrices
      5. Cross-Validation
    4. Decision Trees
      1. Entropy and Information Gain
    5. Naive Bayes Classifiers
      1. Underlying Probabilistic Model
      2. Zero Counts and Smoothing
      3. Non-Binary Features
      4. The Naivete of Independence
      5. The Cause of Double-Counting
    6. Maximum Entropy Classifiers
      1. The Maximum Entropy Model
      2. Maximizing Entropy
      3. Generative Versus Conditional Classifiers
    7. Modeling Linguistic Patterns
      1. What Do Models Tell Us?
    8. Summary
    9. Further Reading
    10. Exercises
  10. 7. Extracting Information from Text
    1. Information Extraction
      1. Information Extraction Architecture
    2. Chunking
      1. Noun Phrase Chunking
      2. Tag Patterns
      3. Chunking with Regular Expressions
      4. Exploring Text Corpora
      5. Chinking
      6. Representing Chunks: Tags Versus Trees
    3. Developing and Evaluating Chunkers
      1. Reading IOB Format and the CoNLL-2000 Chunking Corpus
      2. Simple Evaluation and Baselines
      3. Training Classifier-Based Chunkers
    4. Recursion in Linguistic Structure
      1. Building Nested Structure with Cascaded Chunkers
      2. Trees
      3. Tree Traversal
    5. Named Entity Recognition
    6. Relation Extraction
    7. Summary
    8. Further Reading
    9. Exercises
  11. 8. Analyzing Sentence Structure
    1. Some Grammatical Dilemmas
      1. Linguistic Data and Unlimited Possibilities
      2. Ubiquitous Ambiguity
    2. What’s the Use of Syntax?
      1. Beyond n-grams
    3. Context-Free Grammar
      1. A Simple Grammar
      2. Writing Your Own Grammars
      3. Recursion in Syntactic Structure
    4. Parsing with Context-Free Grammar
      1. Recursive Descent Parsing
      2. Shift-Reduce Parsing
      3. The Left-Corner Parser
      4. Well-Formed Substring Tables
    5. Dependencies and Dependency Grammar
      1. Valency and the Lexicon
      2. Scaling Up
    6. Grammar Development
      1. Treebanks and Grammars
      2. Pernicious Ambiguity
      3. Weighted Grammar
    7. Summary
    8. Further Reading
    9. Exercises
  12. 9. Building Feature-Based Grammars
    1. Grammatical Features
      1. Syntactic Agreement
      2. Using Attributes and Constraints
      3. Terminology
    2. Processing Feature Structures
      1. Subsumption and Unification
    3. Extending a Feature-Based Grammar
      1. Subcategorization
      2. Heads Revisited
      3. Auxiliary Verbs and Inversion
      4. Unbounded Dependency Constructions
      5. Case and Gender in German
    4. Summary
    5. Further Reading
    6. Exercises
  13. 10. Analyzing the Meaning of Sentences
    1. Natural Language Understanding
      1. Querying a Database
      2. Natural Language, Semantics, and Logic
    2. Propositional Logic
    3. First-Order Logic
      1. Syntax
      2. First-Order Theorem Proving
      3. Summarizing the Language of First-Order Logic
      4. Truth in Model
      5. Individual Variables and Assignments
      6. Quantification
      7. Quantifier Scope Ambiguity
      8. Model Building
    4. The Semantics of English Sentences
      1. Compositional Semantics in Feature-Based Grammar
      2. The λ-Calculus
      3. Quantified NPs
      4. Transitive Verbs
      5. Quantifier Ambiguity Revisited
    5. Discourse Semantics
      1. Discourse Representation Theory
      2. Discourse Processing
    6. Summary
    7. Further Reading
    8. Exercises
  14. 11. Managing Linguistic Data
    1. Corpus Structure: A Case Study
      1. The Structure of TIMIT
      2. Notable Design Features
      3. Fundamental Data Types
    2. The Life Cycle of a Corpus
      1. Three Corpus Creation Scenarios
      2. Quality Control
      3. Curation Versus Evolution
    3. Acquiring Data
      1. Obtaining Data from the Web
      2. Obtaining Data from Word Processor Files
      3. Obtaining Data from Spreadsheets and Databases
      4. Converting Data Formats
      5. Deciding Which Layers of Annotation to Include
      6. Standards and Tools
      7. Special Considerations When Working with Endangered Languages
    4. Working with XML
      1. Using XML for Linguistic Structures
      2. The Role of XML
      3. The ElementTree Interface
      4. Using ElementTree for Accessing Toolbox Data
      5. Formatting Entries
    5. Working with Toolbox Data
      1. Adding a Field to Each Entry
      2. Validating a Toolbox Lexicon
    6. Describing Language Resources Using OLAC Metadata
      1. What Is Metadata?
      2. OLAC: Open Language Archives Community
    7. Summary
    8. Further Reading
    9. Exercises
  15. A. Afterword: The Language Challenge
    1. Language Processing Versus Symbol Processing
    2. Contemporary Philosophical Divides
    3. NLTK Roadmap
    4. Envoi...
  16. B. Bibliography
  17. NLTK Index
  18. General Index
  19. About the Authors
  20. Colophon
  21. SPECIAL OFFER: Upgrade this ebook with O’Reilly

Computing with Language: Simple Statistics

Let’s return to our exploration of the ways we can bring our computational resources to bear on large quantities of text. We began this discussion in Computing with Language: Texts and Words, and saw how to search for words in context, how to compile the vocabulary of a text, how to generate random text in the same style, and so on.

In this section, we pick up the question of what makes a text distinct, and use automatic methods to find characteristic words and expressions of a text. As in Computing with Language: Texts and Words, you can try new features of the Python language by copying them into the interpreter, and you’ll learn about these features systematically in the following section.

Before continuing further, you might like to check your understanding of the last section by predicting the output of the following code. You can use the interpreter to check whether you got it right. If you’re not sure how to do this task, it would be a good idea to review the previous section before continuing further.

>>> saying = ['After', 'all', 'is', 'said', 'and', 'done',
...           'more', 'is', 'said', 'than', 'done']
>>> tokens = set(saying)
>>> tokens = sorted(tokens)
>>> tokens[-2:]
what output do you expect here?

Frequency Distributions

How can we automatically identify the words of a text that are most informative about the topic and genre of the text? Imagine how you might go about finding the 50 most frequent words of a book. One method would be to keep a tally for each vocabulary item, like that shown in Figure 1-3. The tally would need thousands of rows, and it would be an exceedingly laborious process—so laborious that we would rather assign the task to a machine.

Counting words appearing in a text (a frequency distribution).

Figure 1-3. Counting words appearing in a text (a frequency distribution).

The table in Figure 1-3 is known as a frequency distribution , and it tells us the frequency of each vocabulary item in the text. (In general, it could count any kind of observable event.) It is a “distribution” since it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let’s use a FreqDist to find the 50 most frequent words of Moby Dick. Try to work out what is going on here, then read the explanation that follows.

>>> fdist1 = FreqDist(text1) 1
>>> fdist1 2
<FreqDist with 260819 outcomes>
>>> vocabulary1 = fdist1.keys() 3
>>> vocabulary1[:50] 4
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-',
'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for',
'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on',
'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were',
'now', 'which', '?', 'me', 'like']
>>> fdist1['whale']

When we first invoke FreqDist, we pass the name of the text as an argument 1. We can inspect the total number of words (“outcomes”) that have been counted up 2—260,819 in the case of Moby Dick. The expression keys() gives us a list of all the distinct types in the text 3, and we can look at the first 50 of these by slicing the list 4.


Your Turn: Try the preceding frequency distribution example for yourself, for text2. Be careful to use the correct parentheses and uppercase letters. If you get an error message NameError: name 'FreqDist' is not defined, you need to start your work with from import *.

Do any words produced in the last example help us grasp the topic or genre of this text? Only one word, whale, is slightly informative! It occurs over 900 times. The rest of the words tell us nothing about the text; they’re just English “plumbing.” What proportion of the text is taken up with such words? We can generate a cumulative frequency plot for these words, using fdist1.plot(50, cumulative=True), to produce the graph in Figure 1-4. These 50 words account for nearly half the book!

Cumulative frequency plot for the 50 most frequently used words in Moby Dick, which account for nearly half of the tokens.

Figure 1-4. Cumulative frequency plot for the 50 most frequently used words in Moby Dick, which account for nearly half of the tokens.

If the frequent words don’t help us, how about the words that occur once only, the so-called hapaxes? View them by typing fdist1.hapaxes(). This list contains lexicographer, cetological, contraband, expostulations, and about 9,000 others. It seems that there are too many rare words, and without seeing the context we probably can’t guess what half of the hapaxes mean in any case! Since neither frequent nor infrequent words help, we need to try something else.

Fine-Grained Selection of Words

Next, let’s look at the long words of a text; perhaps these will be more characteristic and informative. For this we adapt some notation from set theory. We would like to find the words from the vocabulary of the text that are more than 15 characters long. Let’s call this property P, so that P(w) is true if and only if w is more than 15 characters long. Now we can express the words of interest using mathematical set notation as shown in a. This means “the set of all w such that w is an element of V (the vocabulary) and w has property P.”

Example 1-1. 

  1. {w | wV & P(w)}

  2. [w for w in V if p(w)]

The corresponding Python expression is given in b. (Note that it produces a list, not a set, which means that duplicates are possible.) Observe how similar the two notations are. Let’s go one more step and write executable Python code:

>>> V = set(text1)
>>> long_words = [w for w in V if len(w) > 15]
>>> sorted(long_words)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically',
'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations',
'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness',
'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities',
'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness',
'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']

For each word w in the vocabulary V, we check whether len(w) is greater than 15; all other words will be ignored. We will discuss this syntax more carefully later.


Your Turn: Try out the previous statements in the Python interpreter, and experiment with changing the text and changing the length condition. Does it make an difference to your results if you change the variable names, e.g., using [word for word in vocab if ...]?

Let’s return to our task of finding words that characterize a text. Notice that the long words in text4 reflect its national focus—constitutionally, transcontinental—whereas those in text5 reflect its informal content: boooooooooooglyyyyyy and yuuuuuuuuuuuummmmmmmmmmmm. Have we succeeded in automatically extracting words that typify a text? Well, these very long words are often hapaxes (i.e., unique) and perhaps it would be better to find frequently occurring long words. This seems promising since it eliminates frequent short words (e.g., the) and infrequent long words (e.g., antiphilosophists). Here are all words from the chat corpus that are longer than seven characters, that occur more than seven times:

>>> fdist5 = FreqDist(text5)
>>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question',
'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football',
'innocent', 'listening', 'remember', 'seriously', 'something', 'together',
'tomorrow', 'watching']

Notice how we have used two conditions: len(w) > 7 ensures that the words are longer than seven letters, and fdist5[w] > 7 ensures that these words occur more than seven times. At last we have managed to automatically identify the frequently occurring content-bearing words of the text. It is a modest but important milestone: a tiny piece of code, processing tens of thousands of words, produces some informative output.

Collocations and Bigrams

A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds very odd.

To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known as bigrams. This is easily accomplished with the function bigrams():

>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

Here we see that the pair of words than-done is a bigram, and we write it in Python as ('than', 'done'). Now, collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the frequency of individual words. The collocations() function does this for us (we will see how it works later):

>>> text4.collocations()
Building collocations list
United States; fellow citizens; years ago; Federal Government; General
Government; American people; Vice President; Almighty God; Fellow
citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes;
public debt; foreign nations; political parties; State governments;
National Government; United Nations; public money
>>> text8.collocations()
Building collocations list
medium build; social drinker; quiet nights; long term; age open;
financially secure; fun times; similar interests; Age open; poss
rship; single mum; permanent relationship; slim build; seeks lady;
Late 30s; Photo pls; Vibrant personality; European background; ASIAN
LADY; country drives

The collocations that emerge are very specific to the genre of the texts. In order to find red wine as a collocation, we would need to process a much larger body of text.

Counting Other Things

Counting words is useful, but we can count other things too. For example, we can look at the distribution of word lengths in a text, by creating a FreqDist out of a long list of numbers, where each number is the length of the corresponding word in the text:

>>> [len(w) for w in text1] 1
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
>>> fdist = FreqDist([len(w) for w in text1])  2
>>> fdist  3
<FreqDist with 260819 outcomes>
>>> fdist.keys()
[3, 1, 4, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]

We start by deriving a list of the lengths of words in text1 1, and the FreqDist then counts the number of times each of these occurs 2. The result 3 is a distribution containing a quarter of a million items, each of which is a number corresponding to a word token in the text. But there are only 20 distinct items being counted, the numbers 1 through 20, because there are only 20 different word lengths. I.e., there are words consisting of just 1 character, 2 characters, ..., 20 characters, but none with 21 or more characters. One might wonder how frequent the different lengths of words are (e.g., how many words of length 4 appear in the text, are there more words of length 5 than length 4, etc.). We can do this as follows:

>>> fdist.items()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399),
(8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177),
(15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
>>> fdist.max()
>>> fdist[3]
>>> fdist.freq(3)

From this we see that the most frequent word length is 3, and that words of length 3 account for roughly 50,000 (or 20%) of the words making up the book. Although we will not pursue it here, further analysis of word length might help us understand differences between authors, genres, or languages. Table 1-2 summarizes the functions defined in frequency distributions.

Table 1-2. Functions defined for NLTK’s frequency distributions



fdist = FreqDist(samples)

Create a frequency distribution containing the given samples

Increment the count for this sample


Count of the number of times a given sample occurred


Frequency of a given sample


Total number of samples


The samples sorted in order of decreasing frequency

for sample in fdist:

Iterate over the samples, in order of decreasing frequency


Sample with the greatest count


Tabulate the frequency distribution


Graphical plot of the frequency distribution


Cumulative plot of the frequency distribution

fdist1 < fdist2

Test if samples in fdist1 occur less frequently than in fdist2

Our discussion of frequency distributions has introduced some important Python concepts, and we will look at them systematically in Back to Python: Making Decisions and Taking Control.

The best content for your career. Discover unlimited learning on demand for around $1/day.