Get Started with Natural Language Processing in Python
Practical techniques for preparing text for custom search, content recommenders, AI applications, and more.
Python provides a number of excellent packages for natural language processing (NLP) along with great ways to leverage the results. If you’re new to NLP, this course will provide you with initial hands-on work: the confidence to explore much further into use of Deep Learning with text, natural language generation, chatbots, etc. First, however, we’ll show you how to prepare text for parsing, how to extract key phrases, prepare text for indexing in search, calculate similarity between documents, etc.
Increasingly, customers send text to interact or leave comments, which provides a wealth of data for text mining. That’s a great starting point for developing custom search, content recommenders, and even AI applications.
What you'll learn-and how you can apply it
By the end of this live, online course, you’ll understand:
- How keyword analysis, n-grams, co-occurrence, stemming, and other techniques from a previous generation of NLP tools are no longer the best approaches to use.
- Whether NLP requires Big Data tooling and use of clusters; instead, we’ll show practical applications on a laptop.
- That NLP work leading into AI applications is either fully automated or something which requires a huge amount of manual work; instead we’ll demonstrate “human-in-the-loop” practices that make the best of both people skills and automation
- Benefits of using Python for NLP applications
- How statistical parsing works
- How resources such as WordNet enhance text mining
- How to extract more than just a list of keywords from a text
- How to summarize and compare a set of documents
- How deep learning gets used with natural language
And you’ll be able to:
- Prepare texts for parsing, e.g., how to handle difficult Unicode
- Parse sentences into annotated lists, structured as JSON output
- Perform keyword ranking using TF-IDF, while filtering stop words
- Calculate a Jaccard similarity measure to compare texts
- Leverage probabilistic data structures to perform the above more efficiently
- Use Jupyter notebooks for sharing all of the above within their teams
This training course is for you because...
- You are a Python programmer and need to learn how to use available NLP packages
- You are a data scientist with some Python experience and need to leverage NLP and text mining
- You are interested in chatbots, deep learning, and related AI work, and want to understand the basics for handling text data in those use cases
- Some programming in Python (we’ll use Python 3) – for example, be comfortable with the material in Introduction to Python
- Basic understanding of HTML and the DOM structure for web pages – for example, be comfortable with the material in Modern Web Development with HTML5 and CSS
- Access to a computer with a browser
To test whether you will be able to run the jupyter notebooks in your upcoming training, please:
Navigate here: https://notebook.oreilly-jupyterhub.com (This is the link to the test site)
- Sign in with your Safari credentials
- Click "start my server"
Click on "notebook .ipynb"
Run each of the code cells: click the cell then either press Shift+Return or click the triangle in the top menu
There may be a few second delay, but you should eventually see the graphs. If you do not, this probably means that your firewall is blocking JupyterHub's websockets. Please turn off your company VPN or speak with your system administrator to allow.
Review the Probabilistic Data Structures in Python tutorial (~30 min)
Slides will include many links for further study about particular topics, including a GitHub repo for the code used in the Jupyter notebooks
All of the coding exercises in the course will be hosted on JupyterHub, and we'll send the URL out at the start of class. Purely browser-based, no installations required.
About your instructor
Paco Nathan leads the Learning Group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in instructional design, machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the Top 30 People in Big Data and Analytics by Innovation Enterprise. He’s the author of Just Enough Math, Building Data Science Teams, and Enterprise Data Workflows with Cascading.
The timeframes are only estimates and may vary according to how the class is progressing
Section 1. Getting the text… (30min)
- Why use Python for NLP?
- Jump into coding: use BeautifulSoup to extract from HTML
- Why you need to be concerned about Unicode
- Applications where NLP matters: search, recommenders, support, etc.
Section 2. Statistical parsing and annotation (40min)
- How statistical parsing works
- Lemmatization versus stemming
- Intro to using TextBlob, WordNet, etc.
- Exercise: Launching a Jupyter notebook
- Exercise: Splitting sentences and PoS tagging
- Exercise: Noun-phrase chunking
- Exercise: Named entity resolution
- Exercise: Storing annotated text as JSON files
Section 3. Fun things to do with annotated text (30min)
- Exercise: How to apply TF-IDF
- Demo: Semantic similarity and approximated Jaccard measure with MinHash
- Demo: PyTextRank to extract key phrases
Section 4. A preview of advanced topics (20min)
- Demo: Summarization
- Demo: Vector Embedding
- A look at using LSTM to generate film scripts