You are previewing Natural Language Processing with Java.
O'Reilly logo
Natural Language Processing with Java

Book Description

Explore various approaches to organize and extract useful text from unstructured data using Java

In Detail

Natural Language Processing (NLP) is an important area of application development and its relevance in addressing contemporary problems will only increase in the future. There has been a significant increase in the demand for natural language-accessible applications supported by NLP tasks.

Natural Language Processing with Java will explore how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. It covers concepts of NLP that even those of you without a background in statistics or natural language processing can understand.

What You Will Learn

  • Develop a deep understanding of the basic NLP tasks and how they relate to each other

  • Discover and use the available tokenization engines

  • Implement techniques for end of sentence detection

  • Apply search techniques to find people and things within a document

  • Construct solutions to identify parts of speech within sentences

  • Use parsers to extract relationships between elements of a document

  • Integrate basic tasks to tackle more complex NLP problems

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Natural Language Processing with Java
      1. Table of Contents
      2. Natural Language Processing with Java
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Introduction to NLP
        1. What is NLP?
        2. Why use NLP?
        3. Why is NLP so hard?
        4. Survey of NLP tools
          1. Apache OpenNLP
          2. Stanford NLP
          3. LingPipe
          4. GATE
          5. UIMA
        5. Overview of text processing tasks
          1. Finding parts of text
          2. Finding sentences
          3. Finding people and things
          4. Detecting Parts of Speech
          5. Classifying text and documents
          6. Extracting relationships
          7. Using combined approaches
        6. Understanding NLP models
          1. Identifying the task
          2. Selecting a model
          3. Building and training the model
          4. Verifying the model
          5. Using the model
        7. Preparing data
        8. Summary
      9. 2. Finding Parts of Text
        1. Understanding the parts of text
        2. What is tokenization?
          1. Uses of tokenizers
        3. Simple Java tokenizers
          1. Using the Scanner class
            1. Specifying the delimiter
          2. Using the split method
          3. Using the BreakIterator class
          4. Using the StreamTokenizer class
          5. Using the StringTokenizer class
          6. Performance considerations with java core tokenization
        4. NLP tokenizer APIs
          1. Using the OpenNLPTokenizer class
            1. Using the SimpleTokenizer class
            2. Using the WhitespaceTokenizer class
            3. Using the TokenizerME class
          2. Using the Stanford tokenizer
            1. Using the PTBTokenizer class
            2. Using the DocumentPreprocessor class
            3. Using a pipeline
            4. Using LingPipe tokenizers
          3. Training a tokenizer to find parts of text
          4. Comparing tokenizers
        5. Understanding normalization
          1. Converting to lowercase
          2. Removing stopwords
            1. Creating a StopWords class
            2. Using LingPipe to remove stopwords
          3. Using stemming
            1. Using the Porter Stemmer
            2. Stemming with LingPipe
          4. Using lemmatization
            1. Using the StanfordLemmatizer class
            2. Using lemmatization in OpenNLP
          5. Normalizing using a pipeline
        6. Summary
      10. 3. Finding Sentences
        1. The SBD process
        2. What makes SBD difficult?
        3. Understanding SBD rules of LingPipe's HeuristicSentenceModel class
        4. Simple Java SBDs
          1. Using regular expressions
          2. Using the BreakIterator class
        5. Using NLP APIs
          1. Using OpenNLP
            1. Using the SentenceDetectorME class
            2. Using the sentPosDetect method
          2. Using the Stanford API
            1. Using the PTBTokenizer class
            2. Using the DocumentPreprocessor class
            3. Using the StanfordCoreNLP class
          3. Using LingPipe
            1. Using the IndoEuropeanSentenceModel class
            2. Using the SentenceChunker class
            3. Using the MedlineSentenceModel class
        6. Training a Sentence Detector model
          1. Using the Trained model
          2. Evaluating the model using the SentenceDetectorEvaluator class
        7. Summary
      11. 4. Finding People and Things
        1. Why NER is difficult?
        2. Techniques for name recognition
          1. Lists and regular expressions
          2. Statistical classifiers
        3. Using regular expressions for NER
          1. Using Java's regular expressions to find entities
          2. Using LingPipe's RegExChunker class
        4. Using NLP APIs
          1. Using OpenNLP for NER
            1. Determining the accuracy of the entity
            2. Using other entity types
            3. Processing multiple entity types
          2. Using the Stanford API for NER
          3. Using LingPipe for NER
            1. Using LingPipe's name entity models
            2. Using the ExactDictionaryChunker class
        5. Training a model
          1. Evaluating a model
        6. Summary
      12. 5. Detecting Part of Speech
        1. The tagging process
          1. Importance of POS taggers
          2. What makes POS difficult?
        2. Using the NLP APIs
          1. Using OpenNLP POS taggers
            1. Using the OpenNLP POSTaggerME class for POS taggers
            2. Using OpenNLP chunking
            3. Using the POSDictionary class
              1. Obtaining the tag dictionary for a tagger
              2. Determining a word's tags
              3. Changing a word's tags
              4. Adding a new tag dictionary
              5. Creating a dictionary from a file
          2. Using Stanford POS taggers
            1. Using Stanford MaxentTagger
            2. Using the MaxentTagger class to tag textese
            3. Using Stanford pipeline to perform tagging
          3. Using LingPipe POS taggers
            1. Using the HmmDecoder class with Best_First tags
            2. Using the HmmDecoder class with NBest tags
            3. Determining tag confidence with the HmmDecoder class
          4. Training the OpenNLP POSModel
        3. Summary
      13. 6. Classifying Texts and Documents
        1. How classification is used
        2. Understanding sentiment analysis
        3. Text classifying techniques
        4. Using APIs to classify text
          1. Using OpenNLP
            1. Training an OpenNLP classification model
            2. Using DocumentCategorizerME to classify text
          2. Using Stanford API
            1. Using the ColumnDataClassifier class for classification
            2. Using the Stanford pipeline to perform sentiment analysis
          3. Using LingPipe to classify text
            1. Training text using the Classified class
            2. Using other training categories
            3. Classifying text using LingPipe
            4. Sentiment analysis using LingPipe
            5. Language identification using LingPipe
        5. Summary
      14. 7. Using Parser to Extract Relationships
        1. Relationship types
        2. Understanding parse trees
        3. Using extracted relationships
        4. Extracting relationships
        5. Using NLP APIs
          1. Using OpenNLP
          2. Using the Stanford API
            1. Using the LexicalizedParser class
            2. Using the TreePrint class
            3. Finding word dependencies using the GrammaticalStructure class
          3. Finding coreference resolution entities
        6. Extracting relationships for a question-answer system
          1. Finding the word dependencies
          2. Determining the question type
          3. Searching for the answer
        7. Summary
      15. 8. Combined Approaches
        1. Preparing data
          1. Using Boilerpipe to extract text from HTML
          2. Using POI to extract text from Word documents
          3. Using PDFBox to extract text from PDF documents
        2. Pipelines
          1. Using the Stanford pipeline
          2. Using multiple cores with the Stanford pipeline
        3. Creating a pipeline to search text
        4. Summary
      16. Index