You are previewing Python 3 Text Processing with NLTK 3 Cookbook.
O'Reilly logo
Python 3 Text Processing with NLTK 3 Cookbook

Book Description

Over 80 practical recipes on natural language processing techniques using Python's NLTK 3.0

In Detail

This book will show you the essential techniques of text and language processing. Starting with tokenization, stemming, and the WordNet dictionary, you'll progress to part-of-speech tagging, phrase chunking, and named entity recognition. You'll learn how various text corpora are organized, as well as how to create your own custom corpus. Then, you'll move onto text classification with a focus on sentiment analysis. And because NLP can be computationally expensive on large bodies of text, you'll try a few methods for distributed text processing. Finally, you'll be introduced to a number of other small but complementary Python libraries for text analysis, cleaning, and parsing.

This cookbook provides simple, straightforward examples so you can quickly learn text processing with Python and NLTK.

What You Will Learn

  • Tokenize text into sentences, and sentences into words
  • Look up words in the WordNet dictionary
  • Apply spelling correction and word replacement
  • Access the built-in text corpora and create your own custom corpus
  • Tag words with parts of speech
  • Chunk phrases and recognize named entities
  • Grammatically transform phrases and chunks
  • Classify text and perform sentiment analysis
  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Python 3 Text Processing with NLTK 3 Cookbook
      1. Table of Contents
      2. Python 3 Text Processing with NLTK 3 Cookbook
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Tokenizing Text and WordNet Basics
        1. Introduction
        2. Tokenizing text into sentences
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Tokenizing sentences in other languages
          5. See also
        3. Tokenizing sentences into words
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Separating contractions
            2. PunktWordTokenizer
            3. WordPunctTokenizer
          4. See also
        4. Tokenizing sentences using regular expressions
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Simple whitespace tokenizer
          5. See also
        5. Training a sentence tokenizer
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        6. Filtering stopwords in a tokenized sentence
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        7. Looking up Synsets for a word in WordNet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Working with hypernyms
            2. Part of speech (POS)
          5. See also
        8. Looking up lemmas and synonyms in WordNet
          1. How to do it...
          2. How it works...
          3. There's more...
            1. All possible synonyms
            2. Antonyms
          4. See also
        9. Calculating WordNet Synset similarity
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Comparing verbs
            2. Path and Leacock Chordorow (LCH) similarity
          4. See also
        10. Discovering word collocations
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Scoring functions
            2. Scoring ngrams
          5. See also
      9. 2. Replacing and Correcting Words
        1. Introduction
        2. Stemming words
          1. How to do it...
          2. How it works...
          3. There's more...
            1. The LancasterStemmer class
            2. The RegexpStemmer class
            3. The SnowballStemmer class
          4. See also
        3. Lemmatizing words with WordNet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Combining stemming with lemmatization
          5. See also
        4. Replacing words matching regular expressions
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Replacement before tokenization
          5. See also
        5. Removing repeating characters
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        6. Spelling correction with Enchant
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. The en_GB dictionary
            2. Personal word lists
          5. See also
        7. Replacing synonyms
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. CSV synonym replacement
            2. YAML synonym replacement
          5. See also
        8. Replacing negations with antonyms
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
      10. 3. Creating Custom Corpora
        1. Introduction
        2. Setting up a custom corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Loading a YAML file
          5. See also
        3. Creating a wordlist corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Names wordlist corpus
            2. English words corpus
          5. See also
        4. Creating a part-of-speech tagged word corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Customizing the word tokenizer
            2. Customizing the sentence tokenizer
            3. Customizing the paragraph block reader
            4. Customizing the tag separator
            5. Converting tags to a universal tagset
          5. See also
        5. Creating a chunked phrase corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Tree leaves
            2. Treebank chunk corpus
            3. CoNLL2000 corpus
          5. See also
        6. Creating a categorized text corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Category file
            2. Categorized tagged corpus reader
            3. Categorized corpora
          5. See also
        7. Creating a categorized chunk corpus reader
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Categorized CoNLL chunk corpus reader
          5. See also
        8. Lazy corpus loading
          1. How to do it...
          2. How it works...
          3. There's more...
        9. Creating a custom corpus view
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Block reader functions
            2. Pickle corpus view
            3. Concatenated corpus view
          4. See also
        10. Creating a MongoDB-backed corpus reader
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        11. Corpus editing with file locking
          1. Getting ready
          2. How to do it...
          3. How it works...
      11. 4. Part-of-speech Tagging
        1. Introduction
        2. Default tagging
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Evaluating accuracy
            2. Tagging sentences
            3. Untagging a tagged sentence
          5. See also
        3. Training a unigram part-of-speech tagger
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Overriding the context model
            2. Minimum frequency cutoff
          4. See also
        4. Combining taggers with backoff tagging
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Saving and loading a trained tagger with pickle
          4. See also
        5. Training and combining ngram taggers
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Quadgram tagger
          5. See also
        6. Creating a model of likely word tags
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        7. Tagging with regular expressions
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        8. Affix tagging
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Working with min_stem_length
          4. See also
        9. Training a Brill tagger
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Tracing
          4. See also
        10. Training the TnT tagger
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Controlling the beam search
            2. Significance of capitalization
          4. See also
        11. Using WordNet for tagging
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        12. Tagging proper names
          1. How to do it...
          2. How it works...
          3. See also
        13. Classifier-based tagging
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Detecting features with a custom feature detector
            2. Setting a cutoff probability
            3. Using a pre-trained classifier
          4. See also
        14. Training a tagger with NLTK-Trainer
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Saving a pickled tagger
            2. Training on a custom corpus
            3. Training with universal tags
            4. Analyzing a tagger against a tagged corpus
            5. Analyzing a tagged corpus
          4. See also
      12. 5. Extracting Chunks
        1. Introduction
        2. Chunking and chinking with regular expressions
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Parsing different chunk types
            2. Parsing alternative patterns
            3. Chunk rule with context
          5. See also
        3. Merging and splitting chunks with regular expressions
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Specifying rule descriptions
          4. See also
        4. Expanding and removing chunks with regular expressions
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        5. Partial parsing with regular expressions
          1. How to do it...
          2. How it works...
          3. There's more...
            1. The ChunkScore metrics
            2. Looping and tracing chunk rules
          4. See also
        6. Training a tagger-based chunker
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Using different taggers
          4. See also
        7. Classification-based chunking
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Using a different classifier builder
          4. See also
        8. Extracting named entities
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Binary named entity extraction
          4. See also
        9. Extracting proper noun chunks
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        10. Extracting location chunks
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        11. Training a named entity chunker
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        12. Training a chunker with NLTK-Trainer
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Saving a pickled chunker
            2. Training a named entity chunker
            3. Training on a custom corpus
            4. Training on parse trees
            5. Analyzing a chunker against a chunked corpus
            6. Analyzing a chunked corpus
          4. See also
      13. 6. Transforming Chunks and Trees
        1. Introduction
        2. Filtering insignificant words from a sentence
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        3. Correcting verb forms
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        4. Swapping verb phrases
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        5. Swapping noun cardinals
          1. How to do it...
          2. How it works...
          3. See also
        6. Swapping infinitive phrases
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        7. Singularizing plural nouns
          1. How to do it...
          2. How it works...
          3. See also
        8. Chaining chunk transformations
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        9. Converting a chunk tree to text
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        10. Flattening a deep tree
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. The cess_esp and cess_cat treebank
          5. See also
        11. Creating a shallow tree
          1. How to do it...
          2. How it works...
          3. See also
        12. Converting tree labels
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
      14. 7. Text Classification
        1. Introduction
        2. Bag of words feature extraction
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Filtering stopwords
            2. Including significant bigrams
          4. See also
        3. Training a Naive Bayes classifier
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Classification probability
            2. Most informative features
            3. Training estimator
            4. Manual training
          5. See also
        4. Training a decision tree classifier
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Controlling uncertainty with entropy_cutoff
            2. Controlling tree depth with depth_cutoff
            3. Controlling decisions with support_cutoff
          4. See also
        5. Training a maximum entropy classifier
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Megam algorithm
          5. See also
        6. Training scikit-learn classifiers
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Comparing Naive Bayes algorithms
            2. Training with logistic regression
            3. Training with LinearSVC
          5. See also
        7. Measuring precision and recall of a classifier
          1. How to do it...
          2. How it works...
          3. There's more...
            1. F-measure
          4. See also
        8. Calculating high information words
          1. How to do it...
          2. How it works...
          3. There's more...
            1. The MaxentClassifier class with high information words
            2. The DecisionTreeClassifier class with high information words
            3. The SklearnClassifier class with high information words
          4. See also
        9. Combining classifiers with voting
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        10. Classifying with multiple binary classifiers
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        11. Training a classifier with NLTK-Trainer
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Saving a pickled classifier
            2. Using different training instances
            3. The most informative features
            4. The Maxent and LogisticRegression classifiers
            5. SVMs
            6. Combining classifiers
            7. High information words and bigrams
            8. Cross-fold validation
            9. Analyzing a classifier
          4. See also
      15. 8. Distributed Processing and Handling Large Datasets
        1. Introduction
        2. Distributed tagging with execnet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Creating multiple channels
            2. Local versus remote gateways
          5. See also
        3. Distributed chunking with execnet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Python subprocesses
          5. See also
        4. Parallel list processing with execnet
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        5. Storing a frequency distribution in Redis
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        6. Storing a conditional frequency distribution in Redis
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        7. Storing an ordered dictionary in Redis
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        8. Distributed word scoring with Redis and execnet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
      16. 9. Parsing Specific Data Types
        1. Introduction
        2. Parsing dates and times with dateutil
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        3. Timezone lookup and conversion
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Local timezone
            2. Custom offsets
          5. See also
        4. Extracting URLs from HTML with lxml
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Extracting links directly
            2. Parsing HTML from URLs or files
            3. Extracting links with XPaths
          5. See also
        5. Cleaning and stripping HTML
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        6. Converting HTML entities with BeautifulSoup
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Extracting URLs with BeautifulSoup
          5. See also
        7. Detecting and converting character encodings
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Converting to ASCII
            2. UnicodeDammit conversion
          5. See also
      17. A. Penn Treebank Part-of-speech Tags
      18. Index