O'Reilly logo
live online training icon Live Online training

Get Started with Natural Language Processing in Python

enter image description here

Practical techniques for preparing text for custom search, content recommenders, AI applications, and more.

Paco Nathan

Python provides a number of excellent packages for natural language processing (NLP) along with great ways to leverage the results. If you’re new to NLP, this course will provide you with initial hands-on work: the confidence to explore much further into use of Deep Learning with text, natural language generation, chatbots, etc. First, however, we’ll show you how to prepare text for parsing, how to extract key phrases, prepare text for indexing in search, calculate similarity between documents, etc.

Increasingly, customers send text to interact or leave comments, which provides a wealth of data for text mining. That’s a great starting point for developing custom search, content recommenders, and even AI applications.

What you'll learn-and how you can apply it

By the end of this live, online course, you’ll understand:

  • How keyword analysis, n-grams, co-occurrence, stemming, and other techniques from a previous generation of NLP tools are no longer the best approaches to use.
  • Whether NLP requires Big Data tooling and use of clusters; instead, we’ll show practical applications on a laptop.
  • That NLP work leading into AI applications is either fully automated or something which requires a huge amount of manual work; instead we’ll demonstrate “human-in-the-loop” practices that make the best of both people skills and automation
  • Benefits of using Python for NLP applications
  • How statistical parsing works
  • How resources such as WordNet enhance text mining
  • How to extract more than just a list of keywords from a text
  • How to summarize and compare a set of documents
  • How deep learning gets used with natural language

And you’ll be able to:

  • Prepare texts for parsing, e.g., how to handle difficult Unicode
  • Parse sentences into annotated lists, structured as JSON output
  • Perform keyword ranking using TF-IDF, while filtering stop words
  • Calculate a Jaccard similarity measure to compare texts
  • Leverage probabilistic data structures to perform the above more efficiently
  • Use Jupyter notebooks for sharing all of the above within their teams

This training course is for you because...

  • You are a Python programmer and need to learn how to use available NLP packages
  • You are a data scientist with some Python experience and need to leverage NLP and text mining
  • You are interested in chatbots, deep learning, and related AI work, and want to understand the basics for handling text data in those use cases

Prerequisites

  1. Some programming in Python (we’ll use Python 3) – for example, be comfortable with the material in Introduction to Python
  2. Basic understanding of HTML and the DOM structure for web pages – for example, be comfortable with the material in Modern Web Development with HTML5 and CSS
  3. Access to a computer with a browser

System Test:

To test whether you will be able to run the jupyter notebooks in your upcoming training, please:

Navigate here: https://notebook.oreilly-jupyterhub.com (This is the link to the test site)

  • Sign in with your Safari credentials
  • Click "start my server"
  • Click on "notebook .ipynb"

  • Run each of the code cells: click the cell then either press Shift+Return or click the triangle in the top menu

  • There may be a few second delay, but you should eventually see the graphs. If you do not, this probably means that your firewall is blocking JupyterHub's websockets. Please turn off your company VPN or speak with your system administrator to allow.

Recommended Preparation

  • Review the Probabilistic Data Structures in Python tutorial (~30 min)

  • Slides will include many links for further study about particular topics, including a GitHub repo for the code used in the Jupyter notebooks

  • All of the coding exercises in the course will be hosted on JupyterHub, and we'll send the URL out at the start of class. Purely browser-based, no installations required.

About your instructor

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Section 1. Getting the text… (30min)

  • Why use Python for NLP?
  • Jump into coding: use BeautifulSoup to extract from HTML
  • Why you need to be concerned about Unicode
  • Applications where NLP matters: search, recommenders, support, etc.
  • Q&A

Section 2. Statistical parsing and annotation (40min)

  • How statistical parsing works
  • Lemmatization versus stemming
  • Intro to using TextBlob, WordNet, etc.
  • Exercise: Launching a Jupyter notebook
  • Exercise: Splitting sentences and PoS tagging
  • Exercise: Noun-phrase chunking
  • Exercise: Named entity resolution
  • Exercise: Storing annotated text as JSON files
  • Q&A

Section 3. Fun things to do with annotated text (30min)

  • Exercise: How to apply TF-IDF
  • Demo: Semantic similarity and approximated Jaccard measure with MinHash
  • Demo: PyTextRank to extract key phrases
  • Q&A

Section 4. A preview of advanced topics (20min)

  • Demo: Summarization
  • Demo: Vector Embedding
  • A look at using LSTM to generate film scripts
  • Q&A