You are previewing Natural Language Processing with Python.

Natural Language Processing with Python

Cover of Natural Language Processing with Python by Ewan Klein... Published by O'Reilly Media, Inc.
  1. Natural Language Processing with Python
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. Preface
    1. Audience
    2. Emphasis
    3. What You Will Learn
    4. Organization
    5. Why Python?
    6. Software Requirements
    7. Natural Language Toolkit (NLTK)
    8. For Instructors
    9. Conventions Used in This Book
    10. Using Code Examples
    11. Safari® Books Online
    12. How to Contact Us
    13. Acknowledgments
    14. Royalties
  4. 1. Language Processing and Python
    1. Computing with Language: Texts and Words
      1. Getting Started with Python
      2. Getting Started with NLTK
      3. Searching Text
      4. Counting Vocabulary
    2. A Closer Look at Python: Texts as Lists of Words
      1. Lists
      2. Indexing Lists
      3. Variables
      4. Strings
    3. Computing with Language: Simple Statistics
      1. Frequency Distributions
      2. Fine-Grained Selection of Words
      3. Collocations and Bigrams
      4. Counting Other Things
    4. Back to Python: Making Decisions and Taking Control
      1. Conditionals
      2. Operating on Every Element
      3. Nested Code Blocks
      4. Looping with Conditions
    5. Automatic Natural Language Understanding
      1. Word Sense Disambiguation
      2. Pronoun Resolution
      3. Generating Language Output
      4. Machine Translation
      5. Spoken Dialogue Systems
      6. Textual Entailment
      7. Limitations of NLP
    6. Summary
    7. Further Reading
    8. Exercises
  5. 2. Accessing Text Corpora and Lexical Resources
    1. Accessing Text Corpora
      1. Gutenberg Corpus
      2. Web and Chat Text
      3. Brown Corpus
      4. Reuters Corpus
      5. Inaugural Address Corpus
      6. Annotated Text Corpora
      7. Corpora in Other Languages
      8. Text Corpus Structure
      9. Loading Your Own Corpus
    2. Conditional Frequency Distributions
      1. Conditions and Events
      2. Counting Words by Genre
      3. Plotting and Tabulating Distributions
      4. Generating Random Text with Bigrams
    3. More Python: Reusing Code
      1. Creating Programs with a Text Editor
      2. Functions
      3. Modules
    4. Lexical Resources
      1. Wordlist Corpora
      2. A Pronouncing Dictionary
      3. Comparative Wordlists
      4. Shoebox and Toolbox Lexicons
    5. WordNet
      1. Senses and Synonyms
      2. The WordNet Hierarchy
      3. More Lexical Relations
      4. Semantic Similarity
    6. Summary
    7. Further Reading
    8. Exercises
  6. 3. Processing Raw Text
    1. Accessing Text from the Web and from Disk
      1. Electronic Books
      2. Dealing with HTML
      3. Processing Search Engine Results
      4. Processing RSS Feeds
      5. Reading Local Files
      6. Extracting Text from PDF, MSWord, and Other Binary Formats
      7. Capturing User Input
      8. The NLP Pipeline
    2. Strings: Text Processing at the Lowest Level
      1. Basic Operations with Strings
      2. Printing Strings
      3. Accessing Individual Characters
      4. Accessing Substrings
      5. More Operations on Strings
      6. The Difference Between Lists and Strings
    3. Text Processing with Unicode
      1. What Is Unicode?
      2. Extracting Encoded Text from Files
      3. Using Your Local Encoding in Python
    4. Regular Expressions for Detecting Word Patterns
      1. Using Basic Metacharacters
      2. Ranges and Closures
    5. Useful Applications of Regular Expressions
      1. Extracting Word Pieces
      2. Doing More with Word Pieces
      3. Finding Word Stems
      4. Searching Tokenized Text
    6. Normalizing Text
      1. Stemmers
      2. Lemmatization
    7. Regular Expressions for Tokenizing Text
      1. Simple Approaches to Tokenization
      2. NLTK’s Regular Expression Tokenizer
      3. Further Issues with Tokenization
    8. Segmentation
      1. Sentence Segmentation
      2. Word Segmentation
    9. Formatting: From Lists to Strings
      1. From Lists to Strings
      2. Strings and Formats
      3. Lining Things Up
      4. Writing Results to a File
      5. Text Wrapping
    10. Summary
    11. Further Reading
    12. Exercises
  7. 4. Writing Structured Programs
    1. Back to the Basics
      1. Assignment
      2. Equality
      3. Conditionals
    2. Sequences
      1. Operating on Sequence Types
      2. Combining Different Sequence Types
      3. Generator Expressions
    3. Questions of Style
      1. Python Coding Style
      2. Procedural Versus Declarative Style
      3. Some Legitimate Uses for Counters
    4. Functions: The Foundation of Structured Programming
      1. Function Inputs and Outputs
      2. Parameter Passing
      3. Variable Scope
      4. Checking Parameter Types
      5. Functional Decomposition
      6. Documenting Functions
    5. Doing More with Functions
      1. Functions As Arguments
      2. Accumulative Functions
      3. Higher-Order Functions
      4. Named Arguments
    6. Program Development
      1. Structure of a Python Module
      2. Multimodule Programs
      3. Sources of Error
      4. Debugging Techniques
      5. Defensive Programming
    7. Algorithm Design
      1. Recursion
      2. Space-Time Trade-offs
      3. Dynamic Programming
    8. A Sample of Python Libraries
      1. Matplotlib
      2. NetworkX
      3. csv
      4. NumPy
      5. Other Python Libraries
    9. Summary
    10. Further Reading
    11. Exercises
  8. 5. Categorizing and Tagging Words
    1. Using a Tagger
    2. Tagged Corpora
      1. Representing Tagged Tokens
      2. Reading Tagged Corpora
      3. A Simplified Part-of-Speech Tagset
      4. Nouns
      5. Verbs
      6. Adjectives and Adverbs
      7. Unsimplified Tags
      8. Exploring Tagged Corpora
    3. Mapping Words to Properties Using Python Dictionaries
      1. Indexing Lists Versus Dictionaries
      2. Dictionaries in Python
      3. Defining Dictionaries
      4. Default Dictionaries
      5. Incrementally Updating a Dictionary
      6. Complex Keys and Values
      7. Inverting a Dictionary
    4. Automatic Tagging
      1. The Default Tagger
      2. The Regular Expression Tagger
      3. The Lookup Tagger
      4. Evaluation
    5. N-Gram Tagging
      1. Unigram Tagging
      2. Separating the Training and Testing Data
      3. General N-Gram Tagging
      4. Combining Taggers
      5. Tagging Unknown Words
      6. Storing Taggers
      7. Performance Limitations
      8. Tagging Across Sentence Boundaries
    6. Transformation-Based Tagging
    7. How to Determine the Category of a Word
      1. Morphological Clues
      2. Syntactic Clues
      3. Semantic Clues
      4. New Words
      5. Morphology in Part-of-Speech Tagsets
    8. Summary
    9. Further Reading
    10. Exercises
  9. 6. Learning to Classify Text
    1. Supervised Classification
      1. Gender Identification
      2. Choosing the Right Features
      3. Document Classification
      4. Part-of-Speech Tagging
      5. Exploiting Context
      6. Sequence Classification
      7. Other Methods for Sequence Classification
    2. Further Examples of Supervised Classification
      1. Sentence Segmentation
      2. Identifying Dialogue Act Types
      3. Recognizing Textual Entailment
      4. Scaling Up to Large Datasets
    3. Evaluation
      1. The Test Set
      2. Accuracy
      3. Precision and Recall
      4. Confusion Matrices
      5. Cross-Validation
    4. Decision Trees
      1. Entropy and Information Gain
    5. Naive Bayes Classifiers
      1. Underlying Probabilistic Model
      2. Zero Counts and Smoothing
      3. Non-Binary Features
      4. The Naivete of Independence
      5. The Cause of Double-Counting
    6. Maximum Entropy Classifiers
      1. The Maximum Entropy Model
      2. Maximizing Entropy
      3. Generative Versus Conditional Classifiers
    7. Modeling Linguistic Patterns
      1. What Do Models Tell Us?
    8. Summary
    9. Further Reading
    10. Exercises
  10. 7. Extracting Information from Text
    1. Information Extraction
      1. Information Extraction Architecture
    2. Chunking
      1. Noun Phrase Chunking
      2. Tag Patterns
      3. Chunking with Regular Expressions
      4. Exploring Text Corpora
      5. Chinking
      6. Representing Chunks: Tags Versus Trees
    3. Developing and Evaluating Chunkers
      1. Reading IOB Format and the CoNLL-2000 Chunking Corpus
      2. Simple Evaluation and Baselines
      3. Training Classifier-Based Chunkers
    4. Recursion in Linguistic Structure
      1. Building Nested Structure with Cascaded Chunkers
      2. Trees
      3. Tree Traversal
    5. Named Entity Recognition
    6. Relation Extraction
    7. Summary
    8. Further Reading
    9. Exercises
  11. 8. Analyzing Sentence Structure
    1. Some Grammatical Dilemmas
      1. Linguistic Data and Unlimited Possibilities
      2. Ubiquitous Ambiguity
    2. What’s the Use of Syntax?
      1. Beyond n-grams
    3. Context-Free Grammar
      1. A Simple Grammar
      2. Writing Your Own Grammars
      3. Recursion in Syntactic Structure
    4. Parsing with Context-Free Grammar
      1. Recursive Descent Parsing
      2. Shift-Reduce Parsing
      3. The Left-Corner Parser
      4. Well-Formed Substring Tables
    5. Dependencies and Dependency Grammar
      1. Valency and the Lexicon
      2. Scaling Up
    6. Grammar Development
      1. Treebanks and Grammars
      2. Pernicious Ambiguity
      3. Weighted Grammar
    7. Summary
    8. Further Reading
    9. Exercises
  12. 9. Building Feature-Based Grammars
    1. Grammatical Features
      1. Syntactic Agreement
      2. Using Attributes and Constraints
      3. Terminology
    2. Processing Feature Structures
      1. Subsumption and Unification
    3. Extending a Feature-Based Grammar
      1. Subcategorization
      2. Heads Revisited
      3. Auxiliary Verbs and Inversion
      4. Unbounded Dependency Constructions
      5. Case and Gender in German
    4. Summary
    5. Further Reading
    6. Exercises
  13. 10. Analyzing the Meaning of Sentences
    1. Natural Language Understanding
      1. Querying a Database
      2. Natural Language, Semantics, and Logic
    2. Propositional Logic
    3. First-Order Logic
      1. Syntax
      2. First-Order Theorem Proving
      3. Summarizing the Language of First-Order Logic
      4. Truth in Model
      5. Individual Variables and Assignments
      6. Quantification
      7. Quantifier Scope Ambiguity
      8. Model Building
    4. The Semantics of English Sentences
      1. Compositional Semantics in Feature-Based Grammar
      2. The λ-Calculus
      3. Quantified NPs
      4. Transitive Verbs
      5. Quantifier Ambiguity Revisited
    5. Discourse Semantics
      1. Discourse Representation Theory
      2. Discourse Processing
    6. Summary
    7. Further Reading
    8. Exercises
  14. 11. Managing Linguistic Data
    1. Corpus Structure: A Case Study
      1. The Structure of TIMIT
      2. Notable Design Features
      3. Fundamental Data Types
    2. The Life Cycle of a Corpus
      1. Three Corpus Creation Scenarios
      2. Quality Control
      3. Curation Versus Evolution
    3. Acquiring Data
      1. Obtaining Data from the Web
      2. Obtaining Data from Word Processor Files
      3. Obtaining Data from Spreadsheets and Databases
      4. Converting Data Formats
      5. Deciding Which Layers of Annotation to Include
      6. Standards and Tools
      7. Special Considerations When Working with Endangered Languages
    4. Working with XML
      1. Using XML for Linguistic Structures
      2. The Role of XML
      3. The ElementTree Interface
      4. Using ElementTree for Accessing Toolbox Data
      5. Formatting Entries
    5. Working with Toolbox Data
      1. Adding a Field to Each Entry
      2. Validating a Toolbox Lexicon
    6. Describing Language Resources Using OLAC Metadata
      1. What Is Metadata?
      2. OLAC: Open Language Archives Community
    7. Summary
    8. Further Reading
    9. Exercises
  15. A. Afterword: The Language Challenge
    1. Language Processing Versus Symbol Processing
    2. Contemporary Philosophical Divides
    3. NLTK Roadmap
    4. Envoi...
  16. B. Bibliography
  17. NLTK Index
  18. General Index
  19. About the Authors
  20. Colophon
  21. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Questions of Style

Programming is as much an art as a science. The undisputed “bible” of programming, a 2,500 page multivolume work by Donald Knuth, is called The Art of Computer Programming. Many books have been written on Literate Programming, recognizing that humans, not just computers, must read and understand programs. Here we pick up on some issues of programming style that have important ramifications for the readability of your code, including code layout, procedural versus declarative style, and the use of loop variables.

Python Coding Style

When writing programs you make many subtle choices about names, spacing, comments, and so on. When you look at code written by other people, needless differences in style make it harder to interpret the code. Therefore, the designers of the Python language have published a style guide for Python code, available at http://www.python.org/dev/peps/pep-0008/. The underlying value presented in the style guide is consistency, for the purpose of maximizing the readability of code. We briefly review some of its key recommendations here, and refer readers to the full guide for detailed discussion with examples.

Code layout should use four spaces per indentation level. You should make sure that when you write Python code in a file, you avoid tabs for indentation, since these can be misinterpreted by different text editors and the indentation can be messed up. Lines should be less than 80 characters long; if necessary, you can break a line inside parentheses, brackets, or braces, because Python is able to detect that the line continues over to the next line, as in the following examples:

>>> cv_word_pairs = [(cv, w) for w in rotokas_words
...                          for cv in re.findall('[ptksvr][aeiou]', w)]
>>> cfd = nltk.ConditionalFreqDist(
...           (genre, word)
...           for genre in brown.categories()
...           for word in brown.words(categories=genre))
>>> ha_words = ['aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh', 'ahhahahaha',
...             'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'ha',
...             'haaa', 'hah', 'haha', 'hahaaa', 'hahah', 'hahaha']

If you need to break a line outside parentheses, brackets, or braces, you can often add extra parentheses, and you can always add a backslash at the end of the line that is broken:

>>> if (len(syllables) > 4 and len(syllables[2]) == 3 and
...    syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]):
...     process(syllables)
>>> if len(syllables) > 4 and len(syllables[2]) == 3 and \
...    syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]:
...     process(syllables)

Note

Typing spaces instead of tabs soon becomes a chore. Many programming editors have built-in support for Python, and can automatically indent code and highlight any syntax errors (including indentation errors). For a list of Python-aware editors, please see http://wiki.python.org/moin/PythonEditors.

Procedural Versus Declarative Style

We have just seen how the same task can be performed in different ways, with implications for efficiency. Another factor influencing program development is programming style. Consider the following program to compute the average length of words in the Brown Corpus:

>>> tokens = nltk.corpus.brown.words(categories='news')
>>> count = 0
>>> total = 0
>>> for token in tokens:
...     count += 1
...     total += len(token)
>>> print total / count
4.2765382469

In this program we use the variable count to keep track of the number of tokens seen, and total to store the combined length of all words. This is a low-level style, not far removed from machine code, the primitive operations performed by the computer’s CPU. The two variables are just like a CPU’s registers, accumulating values at many intermediate stages, values that are meaningless until the end. We say that this program is written in a procedural style, dictating the machine operations step by step. Now consider the following program that computes the same thing:

>>> total = sum(len(t) for t in tokens)
>>> print total / len(tokens)
4.2765382469

The first line uses a generator expression to sum the token lengths, while the second line computes the average as before. Each line of code performs a complete, meaningful task, which can be understood in terms of high-level properties like: “total is the sum of the lengths of the tokens.” Implementation details are left to the Python interpreter. The second program uses a built-in function, and constitutes programming at a more abstract level; the resulting code is more declarative. Let’s look at an extreme example:

>>> word_list = []
>>> len_word_list = len(word_list)
>>> i = 0
>>> while i < len(tokens):
...     j = 0
...     while j < len_word_list and word_list[j] < tokens[i]:
...         j += 1
...     if j == 0 or tokens[i] != word_list[j]:
...         word_list.insert(j, tokens[i])
...         len_word_list += 1
...     i += 1

The equivalent declarative version uses familiar built-in functions, and its purpose is instantly recognizable:

>>> word_list = sorted(set(tokens))

Another case where a loop counter seems to be necessary is for printing a counter with each line of output. Instead, we can use enumerate(), which processes a sequence s and produces a tuple of the form (i, s[i]) for each item in s, starting with (0, s[0]). Here we enumerate the keys of the frequency distribution, and capture the integer-string pair in the variables rank and word. We print rank+1 so that the counting appears to start from 1, as required when producing a list of ranked items.

>>> fd = nltk.FreqDist(nltk.corpus.brown.words())
>>> cumulative = 0.0
>>> for rank, word in enumerate(fd):
...     cumulative += fd[word] * 100 / fd.N()
...     print "%3d %6.2f%% %s" % (rank+1, cumulative, word)
...     if cumulative > 25:
...         break
...
  1   5.40% the
  2  10.42% ,
  3  14.67% .
  4  17.78% of
  5  20.19% and
  6  22.40% to
  7  24.29% a
  8  25.97% in

It’s sometimes tempting to use loop variables to store a maximum or minimum value seen so far. Let’s use this method to find the longest word in a text.

>>> text = nltk.corpus.gutenberg.words('milton-paradise.txt')
>>> longest = ''
>>> for word in text:
...     if len(word) > len(longest):
...         longest = word
>>> longest
'unextinguishable'

However, a more transparent solution uses two list comprehensions, both having forms that should be familiar by now:

>>> maxlen = max(len(word) for word in text)
>>> [word for word in text if len(word) == maxlen]
['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible']

Note that our first solution found the first word having the longest length, while the second solution found all of the longest words (which is usually what we would want). Although there’s a theoretical efficiency difference between the two solutions, the main overhead is reading the data into main memory; once it’s there, a second pass through the data is effectively instantaneous. We also need to balance our concerns about program efficiency with programmer efficiency. A fast but cryptic solution will be harder to understand and maintain.

Some Legitimate Uses for Counters

There are cases where we still want to use loop variables in a list comprehension. For example, we need to use a loop variable to extract successive overlapping n-grams from a list:

>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> n = 3
>>> [sent[i:i+n] for i in range(len(sent)-n+1)]
[['The', 'dog', 'gave'],
 ['dog', 'gave', 'John'],
 ['gave', 'John', 'the'],
 ['John', 'the', 'newspaper']]

It is quite tricky to get the range of the loop variable right. Since this is a common operation in NLP, NLTK supports it with functions bigrams(text) and trigrams(text), and a general-purpose ngrams(text, n).

Here’s an example of how we can use loop variables in building multidimensional structures. For example, to build an array with m rows and n columns, where each cell is a set, we could use a nested list comprehension:

>>> m, n = 3, 7
>>> array = [[set() for i in range(n)] for j in range(m)]
>>> array[2][5].add('Alice')
>>> pprint.pprint(array)
[[set([]), set([]), set([]), set([]), set([]), set([]), set([])],
 [set([]), set([]), set([]), set([]), set([]), set([]), set([])],
 [set([]), set([]), set([]), set([]), set([]), set(['Alice']), set([])]]

Observe that the loop variables i and j are not used anywhere in the resulting object; they are just needed for a syntactically correct for statement. As another example of this usage, observe that the expression ['very' for i in range(3)] produces a list containing three instances of 'very', with no integers in sight.

Note that it would be incorrect to do this work using multiplication, for reasons concerning object copying that were discussed earlier in this section.

>>> array = [[set()] * n] * m
>>> array[2][5].add(7)
>>> pprint.pprint(array)
[[set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],
 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],
 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])]]

Iteration is an important programming device. It is tempting to adopt idioms from other languages. However, Python offers some elegant and highly readable alternatives, as we have seen.

The best content for your career. Discover unlimited learning on demand for around $1/day.