You are previewing Natural Language Processing with Python.

Natural Language Processing with Python

Cover of Natural Language Processing with Python by Ewan Klein... Published by O'Reilly Media, Inc.
  1. Natural Language Processing with Python
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Audience
    2. Emphasis
    3. What You Will Learn
    4. Organization
    5. Why Python?
    6. Software Requirements
    7. Natural Language Toolkit (NLTK)
    8. For Instructors
    9. Conventions Used in This Book
    10. Using Code Examples
    11. Safari® Books Online
    12. How to Contact Us
    13. Acknowledgments
    14. Royalties
  4. 1. Language Processing and Python
    1. Computing with Language: Texts and Words
      1. Getting Started with Python
      2. Getting Started with NLTK
      3. Searching Text
      4. Counting Vocabulary
    2. A Closer Look at Python: Texts as Lists of Words
      1. Lists
      2. Indexing Lists
      3. Variables
      4. Strings
    3. Computing with Language: Simple Statistics
      1. Frequency Distributions
      2. Fine-Grained Selection of Words
      3. Collocations and Bigrams
      4. Counting Other Things
    4. Back to Python: Making Decisions and Taking Control
      1. Conditionals
      2. Operating on Every Element
      3. Nested Code Blocks
      4. Looping with Conditions
    5. Automatic Natural Language Understanding
      1. Word Sense Disambiguation
      2. Pronoun Resolution
      3. Generating Language Output
      4. Machine Translation
      5. Spoken Dialogue Systems
      6. Textual Entailment
      7. Limitations of NLP
    6. Summary
    7. Further Reading
    8. Exercises
  5. 2. Accessing Text Corpora and Lexical Resources
    1. Accessing Text Corpora
      1. Gutenberg Corpus
      2. Web and Chat Text
      3. Brown Corpus
      4. Reuters Corpus
      5. Inaugural Address Corpus
      6. Annotated Text Corpora
      7. Corpora in Other Languages
      8. Text Corpus Structure
      9. Loading Your Own Corpus
    2. Conditional Frequency Distributions
      1. Conditions and Events
      2. Counting Words by Genre
      3. Plotting and Tabulating Distributions
      4. Generating Random Text with Bigrams
    3. More Python: Reusing Code
      1. Creating Programs with a Text Editor
      2. Functions
      3. Modules
    4. Lexical Resources
      1. Wordlist Corpora
      2. A Pronouncing Dictionary
      3. Comparative Wordlists
      4. Shoebox and Toolbox Lexicons
    5. WordNet
      1. Senses and Synonyms
      2. The WordNet Hierarchy
      3. More Lexical Relations
      4. Semantic Similarity
    6. Summary
    7. Further Reading
    8. Exercises
  6. 3. Processing Raw Text
    1. Accessing Text from the Web and from Disk
      1. Electronic Books
      2. Dealing with HTML
      3. Processing Search Engine Results
      4. Processing RSS Feeds
      5. Reading Local Files
      6. Extracting Text from PDF, MSWord, and Other Binary Formats
      7. Capturing User Input
      8. The NLP Pipeline
    2. Strings: Text Processing at the Lowest Level
      1. Basic Operations with Strings
      2. Printing Strings
      3. Accessing Individual Characters
      4. Accessing Substrings
      5. More Operations on Strings
      6. The Difference Between Lists and Strings
    3. Text Processing with Unicode
      1. What Is Unicode?
      2. Extracting Encoded Text from Files
      3. Using Your Local Encoding in Python
    4. Regular Expressions for Detecting Word Patterns
      1. Using Basic Metacharacters
      2. Ranges and Closures
    5. Useful Applications of Regular Expressions
      1. Extracting Word Pieces
      2. Doing More with Word Pieces
      3. Finding Word Stems
      4. Searching Tokenized Text
    6. Normalizing Text
      1. Stemmers
      2. Lemmatization
    7. Regular Expressions for Tokenizing Text
      1. Simple Approaches to Tokenization
      2. NLTK’s Regular Expression Tokenizer
      3. Further Issues with Tokenization
    8. Segmentation
      1. Sentence Segmentation
      2. Word Segmentation
    9. Formatting: From Lists to Strings
      1. From Lists to Strings
      2. Strings and Formats
      3. Lining Things Up
      4. Writing Results to a File
      5. Text Wrapping
    10. Summary
    11. Further Reading
    12. Exercises
  7. 4. Writing Structured Programs
    1. Back to the Basics
      1. Assignment
      2. Equality
      3. Conditionals
    2. Sequences
      1. Operating on Sequence Types
      2. Combining Different Sequence Types
      3. Generator Expressions
    3. Questions of Style
      1. Python Coding Style
      2. Procedural Versus Declarative Style
      3. Some Legitimate Uses for Counters
    4. Functions: The Foundation of Structured Programming
      1. Function Inputs and Outputs
      2. Parameter Passing
      3. Variable Scope
      4. Checking Parameter Types
      5. Functional Decomposition
      6. Documenting Functions
    5. Doing More with Functions
      1. Functions As Arguments
      2. Accumulative Functions
      3. Higher-Order Functions
      4. Named Arguments
    6. Program Development
      1. Structure of a Python Module
      2. Multimodule Programs
      3. Sources of Error
      4. Debugging Techniques
      5. Defensive Programming
    7. Algorithm Design
      1. Recursion
      2. Space-Time Trade-offs
      3. Dynamic Programming
    8. A Sample of Python Libraries
      1. Matplotlib
      2. NetworkX
      3. csv
      4. NumPy
      5. Other Python Libraries
    9. Summary
    10. Further Reading
    11. Exercises
  8. 5. Categorizing and Tagging Words
    1. Using a Tagger
    2. Tagged Corpora
      1. Representing Tagged Tokens
      2. Reading Tagged Corpora
      3. A Simplified Part-of-Speech Tagset
      4. Nouns
      5. Verbs
      6. Adjectives and Adverbs
      7. Unsimplified Tags
      8. Exploring Tagged Corpora
    3. Mapping Words to Properties Using Python Dictionaries
      1. Indexing Lists Versus Dictionaries
      2. Dictionaries in Python
      3. Defining Dictionaries
      4. Default Dictionaries
      5. Incrementally Updating a Dictionary
      6. Complex Keys and Values
      7. Inverting a Dictionary
    4. Automatic Tagging
      1. The Default Tagger
      2. The Regular Expression Tagger
      3. The Lookup Tagger
      4. Evaluation
    5. N-Gram Tagging
      1. Unigram Tagging
      2. Separating the Training and Testing Data
      3. General N-Gram Tagging
      4. Combining Taggers
      5. Tagging Unknown Words
      6. Storing Taggers
      7. Performance Limitations
      8. Tagging Across Sentence Boundaries
    6. Transformation-Based Tagging
    7. How to Determine the Category of a Word
      1. Morphological Clues
      2. Syntactic Clues
      3. Semantic Clues
      4. New Words
      5. Morphology in Part-of-Speech Tagsets
    8. Summary
    9. Further Reading
    10. Exercises
  9. 6. Learning to Classify Text
    1. Supervised Classification
      1. Gender Identification
      2. Choosing the Right Features
      3. Document Classification
      4. Part-of-Speech Tagging
      5. Exploiting Context
      6. Sequence Classification
      7. Other Methods for Sequence Classification
    2. Further Examples of Supervised Classification
      1. Sentence Segmentation
      2. Identifying Dialogue Act Types
      3. Recognizing Textual Entailment
      4. Scaling Up to Large Datasets
    3. Evaluation
      1. The Test Set
      2. Accuracy
      3. Precision and Recall
      4. Confusion Matrices
      5. Cross-Validation
    4. Decision Trees
      1. Entropy and Information Gain
    5. Naive Bayes Classifiers
      1. Underlying Probabilistic Model
      2. Zero Counts and Smoothing
      3. Non-Binary Features
      4. The Naivete of Independence
      5. The Cause of Double-Counting
    6. Maximum Entropy Classifiers
      1. The Maximum Entropy Model
      2. Maximizing Entropy
      3. Generative Versus Conditional Classifiers
    7. Modeling Linguistic Patterns
      1. What Do Models Tell Us?
    8. Summary
    9. Further Reading
    10. Exercises
  10. 7. Extracting Information from Text
    1. Information Extraction
      1. Information Extraction Architecture
    2. Chunking
      1. Noun Phrase Chunking
      2. Tag Patterns
      3. Chunking with Regular Expressions
      4. Exploring Text Corpora
      5. Chinking
      6. Representing Chunks: Tags Versus Trees
    3. Developing and Evaluating Chunkers
      1. Reading IOB Format and the CoNLL-2000 Chunking Corpus
      2. Simple Evaluation and Baselines
      3. Training Classifier-Based Chunkers
    4. Recursion in Linguistic Structure
      1. Building Nested Structure with Cascaded Chunkers
      2. Trees
      3. Tree Traversal
    5. Named Entity Recognition
    6. Relation Extraction
    7. Summary
    8. Further Reading
    9. Exercises
  11. 8. Analyzing Sentence Structure
    1. Some Grammatical Dilemmas
      1. Linguistic Data and Unlimited Possibilities
      2. Ubiquitous Ambiguity
    2. What’s the Use of Syntax?
      1. Beyond n-grams
    3. Context-Free Grammar
      1. A Simple Grammar
      2. Writing Your Own Grammars
      3. Recursion in Syntactic Structure
    4. Parsing with Context-Free Grammar
      1. Recursive Descent Parsing
      2. Shift-Reduce Parsing
      3. The Left-Corner Parser
      4. Well-Formed Substring Tables
    5. Dependencies and Dependency Grammar
      1. Valency and the Lexicon
      2. Scaling Up
    6. Grammar Development
      1. Treebanks and Grammars
      2. Pernicious Ambiguity
      3. Weighted Grammar
    7. Summary
    8. Further Reading
    9. Exercises
  12. 9. Building Feature-Based Grammars
    1. Grammatical Features
      1. Syntactic Agreement
      2. Using Attributes and Constraints
      3. Terminology
    2. Processing Feature Structures
      1. Subsumption and Unification
    3. Extending a Feature-Based Grammar
      1. Subcategorization
      2. Heads Revisited
      3. Auxiliary Verbs and Inversion
      4. Unbounded Dependency Constructions
      5. Case and Gender in German
    4. Summary
    5. Further Reading
    6. Exercises
  13. 10. Analyzing the Meaning of Sentences
    1. Natural Language Understanding
      1. Querying a Database
      2. Natural Language, Semantics, and Logic
    2. Propositional Logic
    3. First-Order Logic
      1. Syntax
      2. First-Order Theorem Proving
      3. Summarizing the Language of First-Order Logic
      4. Truth in Model
      5. Individual Variables and Assignments
      6. Quantification
      7. Quantifier Scope Ambiguity
      8. Model Building
    4. The Semantics of English Sentences
      1. Compositional Semantics in Feature-Based Grammar
      2. The λ-Calculus
      3. Quantified NPs
      4. Transitive Verbs
      5. Quantifier Ambiguity Revisited
    5. Discourse Semantics
      1. Discourse Representation Theory
      2. Discourse Processing
    6. Summary
    7. Further Reading
    8. Exercises
  14. 11. Managing Linguistic Data
    1. Corpus Structure: A Case Study
      1. The Structure of TIMIT
      2. Notable Design Features
      3. Fundamental Data Types
    2. The Life Cycle of a Corpus
      1. Three Corpus Creation Scenarios
      2. Quality Control
      3. Curation Versus Evolution
    3. Acquiring Data
      1. Obtaining Data from the Web
      2. Obtaining Data from Word Processor Files
      3. Obtaining Data from Spreadsheets and Databases
      4. Converting Data Formats
      5. Deciding Which Layers of Annotation to Include
      6. Standards and Tools
      7. Special Considerations When Working with Endangered Languages
    4. Working with XML
      1. Using XML for Linguistic Structures
      2. The Role of XML
      3. The ElementTree Interface
      4. Using ElementTree for Accessing Toolbox Data
      5. Formatting Entries
    5. Working with Toolbox Data
      1. Adding a Field to Each Entry
      2. Validating a Toolbox Lexicon
    6. Describing Language Resources Using OLAC Metadata
      1. What Is Metadata?
      2. OLAC: Open Language Archives Community
    7. Summary
    8. Further Reading
    9. Exercises
  15. A. Afterword: The Language Challenge
    1. Language Processing Versus Symbol Processing
    2. Contemporary Philosophical Divides
    3. NLTK Roadmap
    4. Envoi...
  16. B. Bibliography
  17. NLTK Index
  18. General Index
  19. About the Authors
  20. Colophon
  21. SPECIAL OFFER: Upgrade this ebook with O’Reilly


  1. ○ Find out more about sequence objects using Python’s help facility. In the interpreter, type help(str), help(list), and help(tuple). This will give you a full list of the functions supported by each type. Some functions have special names flanked with underscores; as the help documentation shows, each such function corresponds to something more familiar. For example x.__getitem__(y) is just a long-winded way of saying x[y].

  2. ○ Identify three operations that can be performed on both tuples and lists. Identify three list operations that cannot be performed on tuples. Name a context where using a list instead of a tuple generates a Python error.

  3. ○ Find out how to create a tuple consisting of a single item. There are at least two ways to do this.

  4. ○ Create a list words = ['is', 'NLP', 'fun', '?']. Use a series of assignment statements (e.g., words[1] = words[2]) and a temporary variable tmp to transform this list into the list ['NLP', 'is', 'fun', '!']. Now do the same transformation using tuple assignment.

  5. ○ Read about the built-in comparison function cmp, by typing help(cmp). How does it differ in behavior from the comparison operators?

  6. ○ Does the method for creating a sliding window of n-grams behave correctly for the two limiting cases: n = 1 and n = len(sent)?

  7. ○ We pointed out that when empty strings and empty lists occur in the condition part of an if clause, they evaluate to False. In this case, they are said to be occurring in a Boolean context. Experiment with different kinds of non-Boolean expressions in Boolean contexts, and see whether they evaluate as True or False.

  8. ○ Use the inequality operators to compare strings, e.g., 'Monty' < 'Python'. What happens when you do 'Z' < 'a'? Try pairs of strings that have a common prefix, e.g., 'Monty' < 'Montague'. Read up on “lexicographical sort” in order to understand what is going on here. Try comparing structured objects, e.g., ('Monty', 1) < ('Monty', 2). Does this behave as expected?

  9. ○ Write code that removes whitespace at the beginning and end of a string, and normalizes whitespace between words to be a single-space character.

    1. Do this task using split() and join().

    2. Do this task using regular expression substitutions.

  10. ○ Write a program to sort words by length. Define a helper function cmp_len which uses the cmp comparison function on word lengths.

  11. Create a list of words and store it in a variable sent1. Now assign sent2 = sent1. Modify one of the items in sent1 and verify that sent2 has changed.

    1. Now try the same exercise, but instead assign sent2 = sent1[:]. Modify sent1 again and see what happens to sent2. Explain.

    2. Now define text1 to be a list of lists of strings (e.g., to represent a text consisting of multiple sentences). Now assign text2 = text1[:], assign a new value to one of the words, e.g., text1[1][1] = 'Monty'. Check what this did to text2. Explain.

    3. Load Python’s deepcopy() function (i.e., from copy import deepcopy), consult its documentation, and test that it makes a fresh copy of any object.

  12. Initialize an n-by-m list of lists of empty strings using list multiplication, e.g., word_table = [[''] * n] * m. What happens when you set one of its values, e.g., word_table[1][2] = "hello"? Explain why this happens. Now write an expression using range() to construct a list of lists, and show that it does not have this problem.

  13. Write code to initialize a two-dimensional array of sets called word_vowels and process a list of words, adding each word to word_vowels[l][v] where l is the length of the word and v is the number of vowels it contains.

  14. Write a function novel10(text) that prints any word that appeared in the last 10% of a text that had not been encountered earlier.

  15. Write a program that takes a sentence expressed as a single string, splits it, and counts up the words. Get it to print out each word and the word’s frequency, one per line, in alphabetical order.

  16. Read up on Gematria, a method for assigning numbers to words, and for mapping between words having the same number to discover the hidden meaning of texts (,

    1. Write a function gematria() that sums the numerical values of the letters of a word, according to the letter values in letter_vals:

      >>> letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8,
      ... 'i':10, 'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100,
      ... 'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}
    2. Process a corpus (e.g., nltk.corpus.state_union) and for each document, count how many of its words have the number 666.

    3. Write a function decode() to process a text, randomly replacing words with their Gematria equivalents, in order to discover the “hidden meaning” of the text.

  17. Write a function shorten(text, n) to process a text, omitting the n most frequently occurring words of the text. How readable is it?

  18. Write code to print out an index for a lexicon, allowing someone to look up words according to their meanings (or their pronunciations; whatever properties are contained in the lexical entries).

  19. Write a list comprehension that sorts a list of WordNet synsets for proximity to a given synset. For example, given the synsets minke_whale.n.01, orca.n.01, novel.n.01, and tortoise.n.01, sort them according to their path_distance() from right_whale.n.01.

  20. Write a function that takes a list of words (containing duplicates) and returns a list of words (with no duplicates) sorted by decreasing frequency. E.g., if the input list contained 10 instances of the word table and 9 instances of the word chair, then table would appear before chair in the output list.

  21. Write a function that takes a text and a vocabulary as its arguments and returns the set of words that appear in the text but not in the vocabulary. Both arguments can be represented as lists of strings. Can you do this in a single line, using set.difference()?

  22. Import the itemgetter() function from the operator module in Python’s standard library (i.e., from operator import itemgetter). Create a list words containing several words. Now try calling: sorted(words, key=itemgetter(1)), and sorted(words, key=itemgetter(-1)). Explain what itemgetter() is doing.

  23. Write a recursive function lookup(trie, key) that looks up a key in a trie, and returns the value it finds. Extend the function to return a word when it is uniquely determined by its prefix (e.g., vanguard is the only word that starts with vang-, so lookup(trie, 'vang') should return the same thing as lookup(trie, 'vanguard')).

  24. Read up on “keyword linkage” (Chapter 5 of (Scott & Tribble, 2006)). Extract keywords from NLTK’s Shakespeare Corpus and using the NetworkX package, plot keyword linkage networks.

  25. Read about string edit distance and the Levenshtein Algorithm. Try the implementation provided in nltk.edit_dist(). In what way is this using dynamic programming? Does it use the bottom-up or top-down approach? (See also

  26. The Catalan numbers arise in many applications of combinatorial mathematics, including the counting of parse trees (Grammar Development). The series can be defined as follows: C0 = 1, and Cn+1 = Σ0..n (CiCn-i).

    1. Write a recursive function to compute nth Catalan number Cn.

    2. Now write another function that does this computation using dynamic programming.

    3. Use the timeit module to compare the performance of these functions as n increases.

  27. ● Reproduce some of the results of (Zhao & Zobel, 2007) concerning authorship identification.

  28. ● Study gender-specific lexical choice, and see if you can reproduce some of the results of

  29. ● Write a recursive function that pretty prints a trie in alphabetically sorted order, for example:

    chair: 'flesh'
    ---t: 'cat'
    --ic: 'stylish'
    ---en: 'dog'
  30. ● With the help of the trie data structure, write a recursive function that processes text, locating the uniqueness point in each word, and discarding the remainder of each word. How much compression does this give? How readable is the resulting text?

  31. ● Obtain some raw text, in the form of a single, long string. Use Python’s textwrap module to break it up into multiple lines. Now write code to add extra spaces between words, in order to justify the output. Each line must have the same width, and spaces must be approximately evenly distributed across each line. No line can begin or end with a space.

  32. ● Develop a simple extractive summarization tool, that prints the sentences of a document which contain the highest total word frequency. Use FreqDist() to count word frequencies, and use sum to sum the frequencies of the words in each sentence. Rank the sentences according to their score. Finally, print the n highest-scoring sentences in document order. Carefully review the design of your program, especially your approach to this double sorting. Make sure the program is written as clearly as possible.

  33. ● Read the following article on semantic orientation of adjectives. Use the NetworkX package to visualize a network of adjectives with edges to indicate same versus different semantic orientation (see

  34. ● Design an algorithm to find the “statistically improbable phrases” of a document collection (see

  35. ● Write a program to implement a brute-force algorithm for discovering word squares, a kind of n × n: crossword in which the entry in the nth row is the same as the entry in the nth column. For discussion, see

The best content for your career. Discover unlimited learning on demand for around $1/day.