You are previewing Natural Language Processing with Java and LingPipe Cookbook.
O'Reilly logo
Natural Language Processing with Java and LingPipe Cookbook

Book Description

Over 60 effective recipes to develop your Natural Language Processing (NLP) skills quickly and effectively

In Detail

NLP is at the core of web search, intelligent personal assistants, marketing, and much more, and LingPipe is a toolkit for processing text using computational linguistics.

This book starts with the foundational but powerful techniques of language identification, sentiment classifiers, and evaluation frameworks. It goes on to detail how to build a robust framework to solve common NLP problems, before ending with advanced techniques for complex heterogeneous NLP systems.

This is a recipe and tutorial book for experienced Java developers with NLP needs. A basic knowledge of NLP terminology will be beneficial. This book will guide you through the process of how to build NLP apps with minimal fuss and maximal impact.

What You Will Learn

  • Master a broad range of classification techniques for text data
  • Track people, concepts, and things in data, within and across documents
  • Understand the importance of evaluation in creation of NLP applications and how to do it
  • Yield best practices for common text-analytics problems
  • Tune systems for high performance and trade off various aspects of the performance curve
  • Become a master in customizing NLP systems at all levels
  • Build systems for non-tokenized languages such as Chinese and Japanese
  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Natural Language Processing with Java and LingPipe Cookbook
      1. Table of Contents
      2. Natural Language Processing with Java and LingPipe Cookbook
      3. Credits
      4. About the Authors
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Simple Classifiers
        1. Introduction
          1. LingPipe and its installation
            1. Projects similar to LingPipe
            2. So, why use LingPipe?
            3. Downloading the book code and data
            4. Downloading LingPipe
        2. Deserializing and running a classifier
          1. How to do it...
          2. How it works...
        3. Getting confidence estimates from a classifier
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. See also
        4. Getting data from the Twitter API
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        5. Applying a classifier to a .csv file
          1. How to do it...
          2. How it works…
        6. Evaluation of classifiers – the confusion matrix
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        7. Training your own language model classifier
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
        8. How to train and evaluate with cross validation
          1. Getting ready
          2. How to do it...
          3. How it works…
          4. There's more…
        9. Viewing error categories – false positives
          1. How to do it...
          2. How it works…
        10. Understanding precision and recall
        11. How to serialize a LingPipe object – classifier example
          1. Getting ready
          2. How to do it...
          3. How it works…
          4. There's more…
        12. Eliminate near duplicates with the Jaccard distance
          1. How to do it…
          2. How it works…
        13. How to classify sentiment – simple version
          1. How to do it…
          2. How it works...
          3. There's more…
            1. Common problems as a classification problem
              1. Topic detection
              2. Question answering
              3. Degree of sentiment
              4. Non-exclusive category classification
              5. Person/company/location detection
      9. 2. Finding and Working with Words
        1. Introduction
        2. Introduction to tokenizer factories – finding words in a character stream
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more…
        3. Combining tokenizers – lowercase tokenizer
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        4. Combining tokenizers – stop word tokenizers
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        5. Using Lucene/Solr tokenizers
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. See also
        6. Using Lucene/Solr tokenizers with LingPipe
          1. How to do it...
          2. How it works...
        7. Evaluating tokenizers with unit tests
          1. How to do it...
        8. Modifying tokenizer factories
          1. How to do it...
          2. How it works...
        9. Finding words for languages without white spaces
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
          5. See also
      10. 3. Advanced Classifiers
        1. Introduction
        2. A simple classifier
          1. How to do it...
          2. How it works...
          3. There's more…
        3. Language model classifier with tokens
          1. How to do it...
          2. There's more...
        4. Naïve Bayes
          1. Getting ready
          2. How to do it...
          3. See also
        5. Feature extractors
          1. How to do it...
          2. How it works…
        6. Logistic regression
          1. How logistic regression works
          2. Getting ready
          3. How to do it...
        7. Multithreaded cross validation
          1. How to do it...
          2. How it works…
        8. Tuning parameters in logistic regression
          1. How to do it...
          2. How it works…
            1. Tuning feature extraction
            2. Priors
            3. Annealing schedule and epochs
        9. Customizing feature extraction
          1. How to do it…
          2. There's more…
        10. Combining feature extractors
          1. How to do it…
          2. There's more…
        11. Classifier-building life cycle
          1. Getting ready
          2. How to do it…
            1. Sanity check – test on training data
            2. Establishing a baseline with cross validation and metrics
            3. Picking a single metric to optimize against
            4. Implementing the evaluation metric
        12. Linguistic tuning
          1. How to do it…
        13. Thresholding classifiers
          1. How to do it...
          2. How it works…
        14. Train a little, learn a little – active learning
          1. Getting ready
          2. How to do it…
          3. How it works...
        15. Annotation
          1. How to do it...
          2. How it works…
          3. There's more…
      11. 4. Tagging Words and Tokens
        1. Introduction
        2. Interesting phrase detection
          1. How to do it...
          2. How it works...
          3. There's more...
        3. Foreground- or background-driven interesting phrase detection
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        4. Hidden Markov Models (HMM) – part-of-speech
          1. How to do it...
          2. How it works...
        5. N-best word tagging
          1. How to do it...
          2. How it works...
        6. Confidence-based tagging
          1. How to do it...
          2. How it works…
        7. Training word tagging
          1. How to do it...
          2. How it works…
          3. There's more…
        8. Word-tagging evaluation
          1. Getting ready
          2. How to do it…
          3. There's more…
        9. Conditional random fields (CRF) for word/token tagging
          1. How to do it...
          2. How it works…
            1. SimpleCrfFeatureExtractor
          3. There's more…
        10. Modifying CRFs
          1. How to do it...
          2. How it works…
            1. Candidate-edge features
            2. Node features
          3. There's more…
      12. 5. Finding Spans in Text – Chunking
        1. Introduction
        2. Sentence detection
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Nested sentences
        3. Evaluation of sentence detection
          1. How to do it...
          2. How it works...
            1. Parsing annotated data
        4. Tuning sentence detection
          1. How to do it...
          2. There's more...
        5. Marking embedded chunks in a string – sentence chunk example
          1. How to do it...
        6. Paragraph detection
          1. How to do it...
        7. Simple noun phrases and verb phrases
          1. How to do it…
          2. How it works…
        8. Regular expression-based chunking for NER
          1. How to do it…
          2. How it works…
          3. See also
        9. Dictionary-based chunking for NER
          1. How to do it…
          2. How it works…
        10. Translating between word tagging and chunks – BIO codec
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        11. HMM-based NER
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
          5. See also
        12. Mixing the NER sources
          1. How to do it…
          2. How it works…
        13. CRFs for chunking
          1. Getting ready
          2. How to do it...
          3. How it works…
        14. NER using CRFs with better features
          1. How to do it…
          2. How it works…
      13. 6. String Comparison and Clustering
        1. Introduction
        2. Distance and proximity – simple edit distance
          1. How to do it...
          2. How it works...
          3. See also
        3. Weighted edit distance
          1. How to do it...
          2. How it works...
          3. See also
        4. The Jaccard distance
          1. How to do it...
          2. How it works...
        5. The Tf-Idf distance
          1. How to do it...
          2. How it works...
          3. There's more...
            1. Difference between supervised and unsupervised trainings
            2. Training on test data is OK
        6. Using edit distance and language models for spelling correction
          1. How to do it...
          2. How it works...
          3. See also
        7. The case restoring corrector
          1. How to do it...
          2. How it works...
          3. See also
        8. Automatic phrase completion
          1. How to do it...
          2. How it works...
          3. See also
        9. Single-link and complete-link clustering using edit distance
          1. How to do it…
          2. There's more…
          3. See also…
        10. Latent Dirichlet allocation (LDA) for multitopic clustering
          1. Getting ready
          2. How to do it…
      14. 7. Finding Coreference Between Concepts/People
        1. Introduction
        2. Named entity coreference with a document
          1. Getting ready
          2. How to do it…
          3. How it works…
        3. Adding pronouns to coreference
          1. How to do it…
          2. How it works…
          3. See also
        4. Cross-document coreference
          1. How to do it...
          2. How it works…
            1. The batch process life cycle
              1. Setting up the entity universe
              2. ProcessDocuments() and ProcessDocument()
              3. Computing XDoc
              4. The promote() method
              5. The createEntitySpeculative() method
              6. The XDocCoref.addMentionChainToEntity() entity
              7. The XDocCoref.resolveMentionChain() entity
              8. The resolveCandidates() method
        5. The John Smith problem
          1. Getting ready
          2. How to do it...
          3. See also
      15. Index