Chapter 9. Parsing Specific Data Types
In this chapter, we will cover the following recipes:
- Parsing dates and times with dateutil
- Timezone lookup and conversion
- Extracting URLs from HTML with lxml
- Cleaning and stripping HTML
- Converting HTML entities with BeautifulSoup
- Detecting and converting character encodings
Introduction
This chapter covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries to accomplish this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to NLTK:
dateutil
provides datetime parsing and timezone conversionlxml
andBeautifulSoup
can parse, clean, and convert HTMLcharade
andUnicodeDammit ...
Get Python 3 Text Processing with NLTK 3 Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.