Chapter 9. Parsing Specific Data Types

In this chapter, we will cover the following recipes:

  • Parsing dates and times with dateutil
  • Timezone lookup and conversion
  • Extracting URLs from HTML with lxml
  • Cleaning and stripping HTML
  • Converting HTML entities with BeautifulSoup
  • Detecting and converting character encodings

Introduction

This chapter covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries to accomplish this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to NLTK:

  • dateutil provides datetime parsing and timezone conversion
  • lxml and BeautifulSoup can parse, clean, and convert HTML
  • charade and UnicodeDammit ...

Get Python 3 Text Processing with NLTK 3 Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.