You are previewing Natural Language Processing with Python.

Natural Language Processing with Python

Cover of Natural Language Processing with Python by Ewan Klein... Published by O'Reilly Media, Inc.
  1. Natural Language Processing with Python
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Audience
    2. Emphasis
    3. What You Will Learn
    4. Organization
    5. Why Python?
    6. Software Requirements
    7. Natural Language Toolkit (NLTK)
    8. For Instructors
    9. Conventions Used in This Book
    10. Using Code Examples
    11. Safari® Books Online
    12. How to Contact Us
    13. Acknowledgments
    14. Royalties
  4. 1. Language Processing and Python
    1. Computing with Language: Texts and Words
      1. Getting Started with Python
      2. Getting Started with NLTK
      3. Searching Text
      4. Counting Vocabulary
    2. A Closer Look at Python: Texts as Lists of Words
      1. Lists
      2. Indexing Lists
      3. Variables
      4. Strings
    3. Computing with Language: Simple Statistics
      1. Frequency Distributions
      2. Fine-Grained Selection of Words
      3. Collocations and Bigrams
      4. Counting Other Things
    4. Back to Python: Making Decisions and Taking Control
      1. Conditionals
      2. Operating on Every Element
      3. Nested Code Blocks
      4. Looping with Conditions
    5. Automatic Natural Language Understanding
      1. Word Sense Disambiguation
      2. Pronoun Resolution
      3. Generating Language Output
      4. Machine Translation
      5. Spoken Dialogue Systems
      6. Textual Entailment
      7. Limitations of NLP
    6. Summary
    7. Further Reading
    8. Exercises
  5. 2. Accessing Text Corpora and Lexical Resources
    1. Accessing Text Corpora
      1. Gutenberg Corpus
      2. Web and Chat Text
      3. Brown Corpus
      4. Reuters Corpus
      5. Inaugural Address Corpus
      6. Annotated Text Corpora
      7. Corpora in Other Languages
      8. Text Corpus Structure
      9. Loading Your Own Corpus
    2. Conditional Frequency Distributions
      1. Conditions and Events
      2. Counting Words by Genre
      3. Plotting and Tabulating Distributions
      4. Generating Random Text with Bigrams
    3. More Python: Reusing Code
      1. Creating Programs with a Text Editor
      2. Functions
      3. Modules
    4. Lexical Resources
      1. Wordlist Corpora
      2. A Pronouncing Dictionary
      3. Comparative Wordlists
      4. Shoebox and Toolbox Lexicons
    5. WordNet
      1. Senses and Synonyms
      2. The WordNet Hierarchy
      3. More Lexical Relations
      4. Semantic Similarity
    6. Summary
    7. Further Reading
    8. Exercises
  6. 3. Processing Raw Text
    1. Accessing Text from the Web and from Disk
      1. Electronic Books
      2. Dealing with HTML
      3. Processing Search Engine Results
      4. Processing RSS Feeds
      5. Reading Local Files
      6. Extracting Text from PDF, MSWord, and Other Binary Formats
      7. Capturing User Input
      8. The NLP Pipeline
    2. Strings: Text Processing at the Lowest Level
      1. Basic Operations with Strings
      2. Printing Strings
      3. Accessing Individual Characters
      4. Accessing Substrings
      5. More Operations on Strings
      6. The Difference Between Lists and Strings
    3. Text Processing with Unicode
      1. What Is Unicode?
      2. Extracting Encoded Text from Files
      3. Using Your Local Encoding in Python
    4. Regular Expressions for Detecting Word Patterns
      1. Using Basic Metacharacters
      2. Ranges and Closures
    5. Useful Applications of Regular Expressions
      1. Extracting Word Pieces
      2. Doing More with Word Pieces
      3. Finding Word Stems
      4. Searching Tokenized Text
    6. Normalizing Text
      1. Stemmers
      2. Lemmatization
    7. Regular Expressions for Tokenizing Text
      1. Simple Approaches to Tokenization
      2. NLTK’s Regular Expression Tokenizer
      3. Further Issues with Tokenization
    8. Segmentation
      1. Sentence Segmentation
      2. Word Segmentation
    9. Formatting: From Lists to Strings
      1. From Lists to Strings
      2. Strings and Formats
      3. Lining Things Up
      4. Writing Results to a File
      5. Text Wrapping
    10. Summary
    11. Further Reading
    12. Exercises
  7. 4. Writing Structured Programs
    1. Back to the Basics
      1. Assignment
      2. Equality
      3. Conditionals
    2. Sequences
      1. Operating on Sequence Types
      2. Combining Different Sequence Types
      3. Generator Expressions
    3. Questions of Style
      1. Python Coding Style
      2. Procedural Versus Declarative Style
      3. Some Legitimate Uses for Counters
    4. Functions: The Foundation of Structured Programming
      1. Function Inputs and Outputs
      2. Parameter Passing
      3. Variable Scope
      4. Checking Parameter Types
      5. Functional Decomposition
      6. Documenting Functions
    5. Doing More with Functions
      1. Functions As Arguments
      2. Accumulative Functions
      3. Higher-Order Functions
      4. Named Arguments
    6. Program Development
      1. Structure of a Python Module
      2. Multimodule Programs
      3. Sources of Error
      4. Debugging Techniques
      5. Defensive Programming
    7. Algorithm Design
      1. Recursion
      2. Space-Time Trade-offs
      3. Dynamic Programming
    8. A Sample of Python Libraries
      1. Matplotlib
      2. NetworkX
      3. csv
      4. NumPy
      5. Other Python Libraries
    9. Summary
    10. Further Reading
    11. Exercises
  8. 5. Categorizing and Tagging Words
    1. Using a Tagger
    2. Tagged Corpora
      1. Representing Tagged Tokens
      2. Reading Tagged Corpora
      3. A Simplified Part-of-Speech Tagset
      4. Nouns
      5. Verbs
      6. Adjectives and Adverbs
      7. Unsimplified Tags
      8. Exploring Tagged Corpora
    3. Mapping Words to Properties Using Python Dictionaries
      1. Indexing Lists Versus Dictionaries
      2. Dictionaries in Python
      3. Defining Dictionaries
      4. Default Dictionaries
      5. Incrementally Updating a Dictionary
      6. Complex Keys and Values
      7. Inverting a Dictionary
    4. Automatic Tagging
      1. The Default Tagger
      2. The Regular Expression Tagger
      3. The Lookup Tagger
      4. Evaluation
    5. N-Gram Tagging
      1. Unigram Tagging
      2. Separating the Training and Testing Data
      3. General N-Gram Tagging
      4. Combining Taggers
      5. Tagging Unknown Words
      6. Storing Taggers
      7. Performance Limitations
      8. Tagging Across Sentence Boundaries
    6. Transformation-Based Tagging
    7. How to Determine the Category of a Word
      1. Morphological Clues
      2. Syntactic Clues
      3. Semantic Clues
      4. New Words
      5. Morphology in Part-of-Speech Tagsets
    8. Summary
    9. Further Reading
    10. Exercises
  9. 6. Learning to Classify Text
    1. Supervised Classification
      1. Gender Identification
      2. Choosing the Right Features
      3. Document Classification
      4. Part-of-Speech Tagging
      5. Exploiting Context
      6. Sequence Classification
      7. Other Methods for Sequence Classification
    2. Further Examples of Supervised Classification
      1. Sentence Segmentation
      2. Identifying Dialogue Act Types
      3. Recognizing Textual Entailment
      4. Scaling Up to Large Datasets
    3. Evaluation
      1. The Test Set
      2. Accuracy
      3. Precision and Recall
      4. Confusion Matrices
      5. Cross-Validation
    4. Decision Trees
      1. Entropy and Information Gain
    5. Naive Bayes Classifiers
      1. Underlying Probabilistic Model
      2. Zero Counts and Smoothing
      3. Non-Binary Features
      4. The Naivete of Independence
      5. The Cause of Double-Counting
    6. Maximum Entropy Classifiers
      1. The Maximum Entropy Model
      2. Maximizing Entropy
      3. Generative Versus Conditional Classifiers
    7. Modeling Linguistic Patterns
      1. What Do Models Tell Us?
    8. Summary
    9. Further Reading
    10. Exercises
  10. 7. Extracting Information from Text
    1. Information Extraction
      1. Information Extraction Architecture
    2. Chunking
      1. Noun Phrase Chunking
      2. Tag Patterns
      3. Chunking with Regular Expressions
      4. Exploring Text Corpora
      5. Chinking
      6. Representing Chunks: Tags Versus Trees
    3. Developing and Evaluating Chunkers
      1. Reading IOB Format and the CoNLL-2000 Chunking Corpus
      2. Simple Evaluation and Baselines
      3. Training Classifier-Based Chunkers
    4. Recursion in Linguistic Structure
      1. Building Nested Structure with Cascaded Chunkers
      2. Trees
      3. Tree Traversal
    5. Named Entity Recognition
    6. Relation Extraction
    7. Summary
    8. Further Reading
    9. Exercises
  11. 8. Analyzing Sentence Structure
    1. Some Grammatical Dilemmas
      1. Linguistic Data and Unlimited Possibilities
      2. Ubiquitous Ambiguity
    2. What’s the Use of Syntax?
      1. Beyond n-grams
    3. Context-Free Grammar
      1. A Simple Grammar
      2. Writing Your Own Grammars
      3. Recursion in Syntactic Structure
    4. Parsing with Context-Free Grammar
      1. Recursive Descent Parsing
      2. Shift-Reduce Parsing
      3. The Left-Corner Parser
      4. Well-Formed Substring Tables
    5. Dependencies and Dependency Grammar
      1. Valency and the Lexicon
      2. Scaling Up
    6. Grammar Development
      1. Treebanks and Grammars
      2. Pernicious Ambiguity
      3. Weighted Grammar
    7. Summary
    8. Further Reading
    9. Exercises
  12. 9. Building Feature-Based Grammars
    1. Grammatical Features
      1. Syntactic Agreement
      2. Using Attributes and Constraints
      3. Terminology
    2. Processing Feature Structures
      1. Subsumption and Unification
    3. Extending a Feature-Based Grammar
      1. Subcategorization
      2. Heads Revisited
      3. Auxiliary Verbs and Inversion
      4. Unbounded Dependency Constructions
      5. Case and Gender in German
    4. Summary
    5. Further Reading
    6. Exercises
  13. 10. Analyzing the Meaning of Sentences
    1. Natural Language Understanding
      1. Querying a Database
      2. Natural Language, Semantics, and Logic
    2. Propositional Logic
    3. First-Order Logic
      1. Syntax
      2. First-Order Theorem Proving
      3. Summarizing the Language of First-Order Logic
      4. Truth in Model
      5. Individual Variables and Assignments
      6. Quantification
      7. Quantifier Scope Ambiguity
      8. Model Building
    4. The Semantics of English Sentences
      1. Compositional Semantics in Feature-Based Grammar
      2. The λ-Calculus
      3. Quantified NPs
      4. Transitive Verbs
      5. Quantifier Ambiguity Revisited
    5. Discourse Semantics
      1. Discourse Representation Theory
      2. Discourse Processing
    6. Summary
    7. Further Reading
    8. Exercises
  14. 11. Managing Linguistic Data
    1. Corpus Structure: A Case Study
      1. The Structure of TIMIT
      2. Notable Design Features
      3. Fundamental Data Types
    2. The Life Cycle of a Corpus
      1. Three Corpus Creation Scenarios
      2. Quality Control
      3. Curation Versus Evolution
    3. Acquiring Data
      1. Obtaining Data from the Web
      2. Obtaining Data from Word Processor Files
      3. Obtaining Data from Spreadsheets and Databases
      4. Converting Data Formats
      5. Deciding Which Layers of Annotation to Include
      6. Standards and Tools
      7. Special Considerations When Working with Endangered Languages
    4. Working with XML
      1. Using XML for Linguistic Structures
      2. The Role of XML
      3. The ElementTree Interface
      4. Using ElementTree for Accessing Toolbox Data
      5. Formatting Entries
    5. Working with Toolbox Data
      1. Adding a Field to Each Entry
      2. Validating a Toolbox Lexicon
    6. Describing Language Resources Using OLAC Metadata
      1. What Is Metadata?
      2. OLAC: Open Language Archives Community
    7. Summary
    8. Further Reading
    9. Exercises
  15. A. Afterword: The Language Challenge
    1. Language Processing Versus Symbol Processing
    2. Contemporary Philosophical Divides
    3. NLTK Roadmap
    4. Envoi...
  16. B. Bibliography
  17. NLTK Index
  18. General Index
  19. About the Authors
  20. Colophon
  21. SPECIAL OFFER: Upgrade this ebook with O’Reilly


  1. ○ Try using the Python interpreter as a calculator, and typing expressions like 12 / (4 + 1).

  2. ○ Given an alphabet of 26 letters, there are 26 to the power 10, or 26 ** 10, 10-letter strings we can form. That works out to 141167095653376L (the L at the end just indicates that this is Python’s long-number format). How many hundred-letter strings are possible?

  3. ○ The Python multiplication operation can be applied to lists. What happens when you type ['Monty', 'Python'] * 20, or 3 * sent1?

  4. ○ Review Computing with Language: Texts and Words on computing with language. How many words are there in text2? How many distinct words are there?

  5. ○ Compare the lexical diversity scores for humor and romance fiction in Table 1-1. Which genre is more lexically diverse?

  6. ○ Produce a dispersion plot of the four main protagonists in Sense and Sensibility: Elinor, Marianne, Edward, and Willoughby. What can you observe about the different roles played by the males and females in this novel? Can you identify the couples?

  7. ○ Find the collocations in text5.

  8. ○ Consider the following Python expression: len(set(text4)). State the purpose of this expression. Describe the two steps involved in performing this computation.

  9. ○ Review A Closer Look at Python: Texts as Lists of Words on lists and strings.

    1. Define a string and assign it to a variable, e.g., my_string = 'My String' (but put something more interesting in the string). Print the contents of this variable in two ways, first by simply typing the variable name and pressing Enter, then by using the print statement.

    2. Try adding the string to itself using my_string + my_string, or multiplying it by a number, e.g., my_string * 3. Notice that the strings are joined together without any spaces. How could you fix this?

  10. ○ Define a variable my_sent to be a list of words, using the syntax my_sent = ["My", "sent"] (but with your own words, or a favorite saying).

    1. Use ' '.join(my_sent) to convert this into a string.

    2. Use split() to split the string back into the list form you had to start with.

  11. ○ Define several variables containing lists of words, e.g., phrase1, phrase2, and so on. Join them together in various combinations (using the plus operator) to form whole sentences. What is the relationship between len(phrase1 + phrase2) and len(phrase1) + len(phrase2)?

  12. ○ Consider the following two expressions, which have the same value. Which one will typically be more relevant in NLP? Why?

    1. "Monty Python"[6:12]

    2. ["Monty", "Python"][1]

  13. ○ We have seen how to represent a sentence as a list of words, where each word is a sequence of characters. What does sent1[2][2] do? Why? Experiment with other index values.

  14. ○ The first sentence of text3 is provided to you in the variable sent3. The index of the in sent3 is 1, because sent3[1] gives us 'the'. What are the indexes of the two other occurrences of this word in sent3?

  15. ○ Review the discussion of conditionals in Back to Python: Making Decisions and Taking Control. Find all words in the Chat Corpus (text5) starting with the letter b. Show them in alphabetical order.

  16. ○ Type the expression range(10) at the interpreter prompt. Now try range(10, 20), range(10, 20, 2), and range(20, 10, -2). We will see a variety of uses for this built-in function in later chapters.

  17. Use text9.index() to find the index of the word sunset. You’ll need to insert this word as an argument between the parentheses. By a process of trial and error, find the slice for the complete sentence that contains this word.

  18. Using list addition, and the set and sorted operations, compute the vocabulary of the sentences sent1 ... sent8.

  19. What is the difference between the following two lines? Which one will give a larger value? Will this be the case for other texts?

    >>> sorted(set([w.lower() for w in text1]))
    >>> sorted([w.lower() for w in set(text1)])
  20. What is the difference between the following two tests: w.isupper() and not w.islower()?

  21. Write the slice expression that extracts the last two words of text2.

  22. Find all the four-letter words in the Chat Corpus (text5). With the help of a frequency distribution (FreqDist), show these words in decreasing order of frequency.

  23. Review the discussion of looping with conditions in Back to Python: Making Decisions and Taking Control. Use a combination of for and if statements to loop over the words of the movie script for Monty Python and the Holy Grail (text6) and print all the uppercase words, one per line.

  24. Write expressions for finding all words in text6 that meet the following conditions. The result should be in the form of a list of words: ['word1', 'word2', ...].

    1. Ending in ize

    2. Containing the letter z

    3. Containing the sequence of letters pt

    4. All lowercase letters except for an initial capital (i.e., titlecase)

  25. Define sent to be the list of words ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']. Now write code to perform the following tasks:

    1. Print all words beginning with sh.

    2. Print all words longer than four characters

  26. What does the following Python code do? sum([len(w) for w in text1]) Can you use it to work out the average word length of a text?

  27. Define a function called vocab_size(text) that has a single parameter for the text, and which returns the vocabulary size of the text.

  28. Define a function percent(word, text) that calculates how often a given word occurs in a text and expresses the result as a percentage.

  29. We have been using sets to store vocabularies. Try the following Python expression: set(sent3) < set(text1). Experiment with this using different arguments to set(). What does it do? Can you think of a practical application for this?

The best content for your career. Discover unlimited learning on demand for around $1/day.