Chapter 9. Parsing Specific Data Types

In this chapter, we will cover the following recipes:

Parsing dates and times with dateutil
Timezone lookup and conversion
Extracting URLs from HTML with lxml
Cleaning and stripping HTML
Converting HTML entities with BeautifulSoup
Detecting and converting character encodings

Introduction

This chapter covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries to accomplish this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to NLTK:

dateutil provides datetime parsing and timezone conversion
lxml and BeautifulSoup can parse, clean, and convert HTML
charade and UnicodeDammit ...

Get Python 3 Text Processing with NLTK 3 Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Python 3 Text Processing with NLTK 3 Cookbook by Jacob Perkins

Chapter 9. Parsing Specific Data Types

Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly