O'Reilly logo

Clean Data by Megan Squire

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Example project – Extracting data from e-mail and web forums

The Django IRC logs project was pretty simple. It was designed to show you the differences between three solid techniques that are commonly used to extract clean data from within HTML pages. The data we extracted included the line number, the username, and the IRC chat message, all of which were easy to find and required almost no additional cleaning. In this new example project, we will consider a case that is conceptually similar, but that will require us to extend the idea of data extraction beyond HTML to two other types of semi-structured text found on the Web: e-mail messages hosted on the Web and web-based discussion forums.

The background of the project

I was recently working on ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required