You are previewing Python Text Processing with NLTK 2.0 Cookbook.
O'Reilly logo
Python Text Processing with NLTK 2.0 Cookbook

Book Description

Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.

  • Quickly get to grips with Natural Language Processing - with Text Analysis, Text Mining, and beyond

  • Learn how machines and crawlers interpret and process natural languages

  • Easily work with huge amounts of data and learn how to handle distributed processing

  • Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible

  • In Detail

    Natural Language Processing is used everywhere - in search engines, spell checkers, mobile phones, computer games - even your washing machine. Python's Natural Language Toolkit (NTLK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing - and this book is your answer.

    Python Text Processing with NTLK 2.0 Cookbook is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step–by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NTLK suite.

    This book cuts short the preamble and you dive right into the science of text processing with a practical hands-on approach.

    Get started off with learning tokenization of text. Get an overview of WordNet and how to use it. Learn the basics as well as advanced features of Stemming and Lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for JSON files as well as for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed.

    This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NTLK for Natural Language Processing with this handy companion.

    Table of Contents

    1. Python Text Processing with NLTK 2.0 Cookbook
      1. Table of Contents
      2. Python Text Processing with NLTK 2.0 Cookbook
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Errata
          2. Piracy
          3. Questions
      7. 1. Tokenizing Text and WordNet Basics
        1. Introduction
        2. Tokenizing text into sentences
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Other languages
          5. See also
        3. Tokenizing sentences into words
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Contractions
            2. PunktWordTokenizer
            3. WordPunctTokenizer
          4. See also
        4. Tokenizing sentences using regular expressions
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Simple whitespace tokenizer
          5. See also
        5. Filtering stopwords in a tokenized sentence
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        6. Looking up synsets for a word in WordNet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Hypernyms
            2. Part-of-speech (POS)
          5. See also
        7. Looking up lemmas and synonyms in WordNet
          1. How to do it...
          2. How it works...
          3. There's more...
            1. All possible synonyms
            2. Antonyms
          4. See also
        8. Calculating WordNet synset similarity
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Comparing verbs
            2. Path and LCH similarity
          4. See also
        9. Discovering word collocations
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Scoring functions
            2. Scoring ngrams
          5. See also
      8. 2. Replacing and Correcting Words
        1. Introduction
        2. Stemming words
          1. How to do it...
          2. How it works...
          3. There's more...
            1. LancasterStemmer
            2. RegexpStemmer
            3. SnowballStemmer
          4. See also
        3. Lemmatizing words with WordNet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Combining stemming with lemmatization
          5. See also
        4. Translating text with Babelfish
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Available languages
        5. Replacing words matching regular expressions
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Replacement before tokenization
          5. See also
        6. Removing repeating characters
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        7. Spelling correction with Enchant
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. en_GB dictionary
            2. Personal word lists
          5. See also
        8. Replacing synonyms
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. CSV synonym replacement
            2. YAML synonym replacement
          5. See also
        9. Replacing negations with antonyms
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
      9. 3. Creating Custom Corpora
        1. Introduction
        2. Setting up a custom corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Loading a YAML file
          5. See also
        3. Creating a word list corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Names corpus
            2. English words
          5. See also
        4. Creating a part-of-speech tagged word corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Customizing the word tokenizer
            2. Customizing the sentence tokenizer
            3. Customizing the paragraph block reader
            4. Customizing the tag separator
            5. Simplifying tags with a tag mapping function
          5. See also
        5. Creating a chunked phrase corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Tree leaves
            2. Treebank chunk corpus
            3. CoNLL2000 corpus
          5. See also
        6. Creating a categorized text corpus
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Category file
            2. Categorized tagged corpus reader
            3. Categorized corpora
          5. See also
        7. Creating a categorized chunk corpus reader
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Categorized Conll chunk corpus reader
          5. See also
        8. Lazy corpus loading
          1. How to do it...
          2. How it works...
          3. There's more...
        9. Creating a custom corpus view
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Block reader functions
            2. Pickle corpus view
            3. Concatenated corpus view
          4. See also
        10. Creating a MongoDB backed corpus reader
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        11. Corpus editing with file locking
          1. Getting ready
          2. How to do it...
          3. How it works...
      10. 4. Part-of-Speech Tagging
        1. Introduction
        2. Default tagging
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Evaluating accuracy
            2. Batch tagging sentences
            3. Untagging a tagged sentence
          5. See also
        3. Training a unigram part-of-speech tagger
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Overriding the context model
            2. Minimum frequency cutoff
          4. See also
        4. Combining taggers with backoff tagging
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Pickling and unpickling a trained tagger
          4. See also
        5. Training and combining Ngram taggers
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Quadgram Tagger
          5. See also
        6. Creating a model of likely word tags
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        7. Tagging with regular expressions
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        8. Affix tagging
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Min stem length
          4. See also
        9. Training a Brill tagger
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Tracing
          4. See also
        10. Training the TnT tagger
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Controlling the beam search
            2. Capitalization significance
          4. See also
        11. Using WordNet for tagging
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        12. Tagging proper names
          1. How to do it...
          2. How it works...
          3. See also
        13. Classifier based tagging
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Custom feature detector
            2. Cutoff probability
            3. Pre-trained classifier
          4. See also
      11. 5. Extracting Chunks
        1. Introduction
        2. Chunking and chinking with regular expressions
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Different chunk types
            2. Alternative patterns
            3. Chunk rule with context
          5. See also
        3. Merging and splitting chunks with regular expressions
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Rule descriptions
          4. See also
        4. Expanding and removing chunks with regular expressions
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        5. Partial parsing with regular expressions
          1. How to do it...
          2. How it works...
          3. There's more...
            1. ChunkScore metrics
            2. Looping and tracing
          4. See also
        6. Training a tagger-based chunker
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Using different taggers
          4. See also
        7. Classification-based chunking
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Using a different classifier builder
          4. See also
        8. Extracting named entities
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Binary named entity extraction
          4. See also
        9. Extracting proper noun chunks
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        10. Extracting location chunks
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        11. Training a named entity chunker
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
      12. 6. Transforming Chunks and Trees
        1. Introduction
        2. Filtering insignificant words
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        3. Correcting verb forms
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        4. Swapping verb phrases
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        5. Swapping noun cardinals
          1. How to do it...
          2. How it works...
          3. See also
        6. Swapping infinitive phrases
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        7. Singularizing plural nouns
          1. How to do it...
          2. How it works...
          3. See also
        8. Chaining chunk transformations
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        9. Converting a chunk tree to text
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        10. Flattening a deep tree
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. CESS-ESP and CESS-CAT treebank
          5. See also
        11. Creating a shallow tree
          1. How to do it...
          2. How it works...
          3. See also
        12. Converting tree nodes
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
      13. 7. Text Classification
        1. Introduction
        2. Bag of Words feature extraction
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Filtering stopwords
            2. Including significant bigrams
          4. See also
        3. Training a naive Bayes classifier
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Classification probability
            2. Most informative features
            3. Training estimator
            4. Manual training
          5. See also
        4. Training a decision tree classifier
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Entropy cutoff
            2. Depth cutoff
            3. Support cutoff
          5. See also
        5. Training a maximum entropy classifier
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Scipy algorithms
            2. Megam algorithm
          5. See also
        6. Measuring precision and recall of a classifier
          1. How to do it...
          2. How it works...
          3. There's more...
            1. F-measure
          4. See also
        7. Calculating high information words
          1. How to do it...
          2. How it works...
          3. There's more...
            1. MaxentClassifier with high information words
            2. DecisionTreeClassifier with high information words
          4. See also
        8. Combining classifiers with voting
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        9. Classifying with multiple binary classifiers
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
      14. 8. Distributed Processing and Handling Large Datasets
        1. Introduction
        2. Distributed tagging with execnet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Multiple channels
            2. Local versus remote gateways
          5. See also
        3. Distributed chunking with execnet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Python subprocesses
          5. See also
        4. Parallel list processing with execnet
          1. How to do it...
          2. How it works...
          3. There's more...
          4. See also
        5. Storing a frequency distribution in Redis
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        6. Storing a conditional frequency distribution in Redis
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        7. Storing an ordered dictionary in Redis
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        8. Distributed word scoring with Redis and execnet
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
      15. 9. Parsing Specific Data
        1. Introduction
        2. Parsing dates and times with Dateutil
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        3. Time zone lookup and conversion
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Local time zone
            2. Custom offsets
          5. See also
        4. Tagging temporal expressions with Timex
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        5. Extracting URLs from HTML with lxml
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Extracting links directly
            2. Parsing HTML from URLs or files
            3. Extracting links with XPaths
          5. See also
        6. Cleaning and stripping HTML
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        7. Converting HTML entities with BeautifulSoup
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Extracting URLs with BeautifulSoup
          5. See also
        8. Detecting and converting character encodings
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Converting to ASCII
          5. See also
      16. A. Penn Treebank Part-of-Speech Tags
      17. Index