O'Reilly logo

Clean Data by Megan Squire

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Method two – Python and BeautifulSoup

Since regular expressions have some limitations, we will definitely need more tools in our data cleaning toolkit. Here, we describe how to extract data from HTML pages using a parse tree-based Python library called BeautifulSoup.

Step one – find and save a file for experimenting

For this step, we will use the same file as we did for Method 1: the file from the Django IRC channel. We will search for the same three items. Doing this will make it easy to compare the two methods to each other.

Step two – install BeautifulSoup

BeautifulSoup is currently in version 4. This version will work for both Python 2.7 and Python 3.

Note

If you are using the Enthought Canopy Python environment, simply run pip install beautifulsoup4 ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required