Since regular expressions have some limitations, we will definitely need more tools in our data cleaning toolkit. Here, we describe how to extract data from HTML pages using a parse tree-based Python library called BeautifulSoup.
For this step, we will use the same file as we did for Method 1: the file from the Django IRC channel. We will search for the same three items. Doing this will make it easy to compare the two methods to each other.
BeautifulSoup is currently in version 4. This version will work for both Python 2.7 and Python 3.
If you are using the Enthought Canopy Python environment, simply run
pip install beautifulsoup4 ...