O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Applied Text Analysis with Python

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles. You’ll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

The programming landscape of natural language processing has changed dramatically in the past few years. Machine learning approaches now require mature tools like Python’s scikit-learn to apply models to text at scale. This practical guide shows programmers and data scientists who have an intermediate-level understanding of Python and a basic understanding of machine learning and natural language processing how to become more proficient in these two exciting areas of data science.

This book presents a concise, focused, and applied approach to text analysis with Python, and covers topics including text ingestion and wrangling, basic machine learning on text, classification for text analysis, entity resolution, and text visualization. Applied Text Analysis with Python will enable you to design and develop language-aware data products.

You’ll learn how and why machine learning algorithms make decisions about language to analyze text; how to ingest, wrangle, and preprocess language data; and how the three primary text analysis libraries in Python work in concert. Ultimately, this book will enable you to design and develop language-aware data products.

Table of Contents

  1. Preface
    1. Computational Challenges of Natural Language
      1. Linguistic Data: Tokens and Words
      2. Enter Machine Learning
    2. Tools for Text Analysis
    3. What to Expect from This Book
    4. Who This Book Is For
    5. Code Examples and GitHub Repository
    6. Conventions Used in This Book
    7. Using Code Examples
    8. Safari® Books Online
    9. How to Contact Us
    10. Acknowledgments
  2. 1. Language and Computation
    1. The Data Science Paradigm
    2. Language-Aware Data Products
      1. The Data Product Pipeline
    3. Language as Data
      1. A Computational Model of Language
      2. Language Features
      3. Contextual Features
      4. Structural Features
    4. Conclusion
  3. 2. Building a Custom Corpus
    1. What Is a Corpus?
      1. Domain-Specific Corpora
      2. The Baleen Ingestion Engine
    2. Corpus Data Management
      1. Corpus Disk Structure
    3. Corpus Readers
      1. Streaming Data Access with NLTK
      2. Reading an HTML Corpus
      3. Reading a Corpus from a Database
    4. Conclusion
  4. 3. Corpus Preprocessing and Wrangling
    1. Breaking Down Documents
      1. Identifying and Extracting Core Content
      2. Deconstructing Documents into Paragraphs
      3. Segmentation: Breaking Out Sentences
      4. Tokenization: Identifying Individual Tokens
      5. Part-of-Speech Tagging
      6. Intermediate Corpus Analytics
    2. Corpus Transformation
      1. Intermediate Preprocessing and Storage
      2. Reading the Processed Corpus
    3. Conclusion
  5. 4. Text Vectorization and Transformation Pipelines
    1. Words in Space
      1. Frequency Vectors
      2. One-Hot Encoding
      3. Term Frequency–Inverse Document Frequency
      4. Distributed Representation
    2. The Scikit-Learn API
      1. The BaseEstimator Interface
      2. Extending TransformerMixin
    3. Pipelines
      1. Pipeline Basics
      2. Grid Search for Hyperparameter Optimization
      3. Enriching Feature Extraction with Feature Unions
    4. Conclusion
  6. 5. Classification for Text Analysis
    1. Text Classification
      1. Identifying Classification Problems
      2. Classifier Models
    2. Building a Text Classification Application
      1. Cross-Validation
      2. Model Construction
      3. Model Evaluation
      4. Model Operationalization
    3. Conclusion
  7. 6. Clustering for Text Similarity
    1. Unsupervised Learning on Text
    2. Clustering by Document Similarity
      1. Distance Metrics
      2. Partitive Clustering
      3. Hierarchical Clustering
    3. Modeling Document Topics
      1. Latent Dirichlet Allocation (LDA)
      2. Latent Semantic Analysis (LSA)
      3. Non-Negative Matrix Factorization
    4. Conclusion
  8. 7. Context-Aware Text Analysis
    1. Grammar-Based Feature Extraction
      1. Extracting Keyphrases
      2. Extracting Entities
    2. n-Gram Feature Extraction
      1. An n-Gram-Aware CorpusReader
      2. Choosing the Right n-Gram Window
      3. Significant Collocations
    3. n-Gram Language Models
      1. Frequency and Conditional Frequency
      2. Estimating Maximum Likelihood
      3. Unknown Words: Back-off and Smoothing
      4. Language Generation
    4. Conclusion
  9. 8. Text Visualization
    1. Visualizing Feature Space
      1. Visual Feature Analysis
      2. Guided Feature Engineering
    2. Model Diagnostics
      1. Visualizing Clusters
      2. Visualizing Classes
      3. Diagnosing Classification Error
    3. Visual Steering
      1. Silhouette Scores and Elbow Curves
    4. Conclusion
  10. 9. Graph Analysis of Text
    1. Graph Computation and Analysis
      1. Creating a Graph-based Thesaurus
      2. Analyzing Graph Structure
      3. Visual Analysis of Graphs
    2. Extracting Graphs from Text
      1. Creating a Social Graph
      2. Insights from the Social Graph
    3. Entity Resolution
      1. Entity Resolution on a Graph
      2. Blocking with Structure
      3. Fuzzy Blocking
    4. Conclusion
  11. 10. Chatbots
    1. Fundamentals of Conversation
      1. Dialog: A Brief Exchange
      2. Maintaining a Conversation
    2. Rules for Polite Conversation
      1. Greetings and Salutations
      2. Handling Miscommunication
    3. Entertaining Questions
      1. Dependency Parsing
      2. Constituency Parsing
      3. Question Detection
      4. From Tablespoons to Grams
    4. Learning to Help
      1. Being Neighborly
      2. Offering Recommendations
    5. Conclusion
  12. 11. Scaling Text Analytics with Multiprocessing and Spark
    1. Python Multiprocessing
      1. Running Tasks in Parallel
      2. Process Pools and Queues
      3. Parallel Corpus Preprocessing
    2. Cluster Computing with Spark
      1. Anatomy of a Spark Job
      2. Distributing the Corpus
      3. RDD Operations
      4. NLP with Spark
    3. Conclusion
  13. 12. Deep Learning and Beyond
    1. Applied Neural Networks
    2. Neural Language Models
      1. Artificial Neural Networks
      2. Deep Learning Architectures
    3. Sentiment Analysis
      1. Deep Structure Analysis
    4. The Future Is (Almost) Here
  14. A. Installing Libraries and Downloading Corpora
  15. Glossary
  16. Index