O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Path: Get Started with Natural Language Processing Using Python, Spark, and Scala

Video Description

Whether you’re a programmer with little to no knowledge of Python, or an experienced data scientist or engineer, this Learning Path will walk you through natural language processing, using both Python and Scala, and show you how to implement a range of popular tools including Spark, scikit-learn, SpaCy, NLTK, and gensim for text mining.

You’ll learn the most common techniques for processing text, how to use machine learning to generate annotators and apply them within a data pipeline, and the differences between NLP pipelines and other approaches to semantic text mining. You’ll learn about standard UIMA annotators, custom annotators, and machine-learned annotators, and understand how architectures for text processing pipelines can incorporate some of the most popular big data tools such as Kafka, Spark, SparkSQL, Cassandra, and ElasticSearch.

By the end of the learning path, you will be able to build a natural language processing and entity extraction pipeline, and will have a complete understanding of the capabilities and limitations of natural language text processing.

Materials or downloads needed in advance: Example files

Table of Contents

  1. Introduction
    1. Course Introduction 00:02:25
    2. About The Author 00:00:36
  2. Getting Started: Basic String Processing In Python
    1. String Operations 00:04:49
    2. Working With Unicode 00:05:16
  3. Converting Text To Symbols: Tokenization In NLTK and spaCy
    1. Splitting Documents 00:04:41
    2. Splitting Sentences 00:03:20
    3. Filtering Stop Words 00:02:07
  4. Going Subsymbolic: Vector Representations
    1. tf-idf Gensim 00:09:24
    2. Word Vectors 00:03:35
    3. Google Word Vectors 00:04:03
    4. Learn Word Vectors 00:08:07
  5. Finding The Structure Of Text: Parsing In spaCy
    1. Dependency Parsing 00:03:39
    2. Sentence Head 00:02:23
    3. Named Entities 00:03:21
  6. Determining How The Writer Feels: Sentiment Analysis In VADER
    1. Sentiment Analysis Intro 00:03:18
    2. Sentiment In VADER 00:05:13
  7. Making Decisions: Text Classification
    1. Text Classification Intro 00:02:45
    2. Classification With TextBlob 00:10:25
    3. Classification With scikit-learn 00:07:17
  8. Indentifying Discussed Topics: LDA In Gensim
    1. LDA Introduction 00:02:43
    2. LDA Gensim 00:07:13
    3. LDA pyLDAvis 00:03:54
  9. Toward Machine Reading: Entity Extraction And Linking
    1. Entity Linking 00:03:28
    2. pyspotlight 00:03:16
    3. FRED 00:03:16
  10. Conclusion
    1. Conclusion 00:02:24
  11. Part 1: Introduction
    1. Welcome to the Course 00:01:39
    2. Natural Language Understanding in Examples 00:10:09
  12. Part 2: NLP Pipelines
    1. Building an NLP Pipeline 00:15:49
  13. Part 3 - Annotators
    1. Commonly Used Annotators 00:08:47
    2. Detecting Positive, Negative & Speculative Polarity 00:12:09
    3. Machine Learned Annotators 00:12:16
  14. Part 4: Custom Annotators
    1. NLP Pipelines are Domain Specific 00:06:55
    2. Unified Medical Language System (UMLS) 00:03:33
    3. Coding Custom Annotators 00:07:17
  15. Part 5: Machine Learned Annotators
    1. Training & Using Machine Learned Annotators 00:09:45
  16. Part 6: Ontology Enrichment
    1. The Need for Learned and Updated Ontologies 00:09:39
    2. Learning New Medical Concepts and Relationships 00:19:37
  17. Part 7: Architecture
    1. An End-to-End Reference Architecture 00:04:19
    2. Spark, SparkSQL, Cassandra Workflow 00:03:16
    3. ElasticSearch & SparkSQL 00:06:52
  18. Part 8: Parting Advice
    1. Language is Source and Domain-Specific 00:09:32
    2. Welcome to the Course 00:01:37
  19. Part 1: Building a natural language processing and entity extraction pipeline on Scala & Spark
    1. Notebook 1: Introduction 00:02:35
    2. Annotation Library 00:04:15
    3. Basic Annotators 00:08:59
    4. Vocabulary Analysis 00:09:30
    5. Exercise: Building a stopword annotator 00:05:06
  20. Part 2: Machine Learning Applications for Statistical Natural Language Understanding at Scale
    1. Notebook 2: Introduction 00:02:14
    2. Model-based Annotators 00:04:18
    3. Creating a Binary Classifier 00:14:38
    4. Exercise: Predicting score or popularity 00:05:30
  21. Part 3: Topic Modeling on Natural Language with Scala, Spark and MLLib
    1. Notebook 3: Introduction 00:02:12
    2. K-Means clustering 00:07:03
    3. LDA topic modeling 00:07:39
    4. Exercise: Using topics for score or popularity prediction 00:02:36
  22. Part 4: Deep Learning Applications for Natural Language Understanding with Scala, Spark and MLLib
    1. Notebook 4: Introduction 00:02:07
    2. Word2Vec 00:05:05
    3. Expanding genre entity lists 00:04:49
    4. Exercise: Using Word2Vec based features for score or popularity prediction 00:02:44