O'Reilly logo

Web Scraping with Python by Ryan Mitchell

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 2. Advanced HTML Parsing

When Michelangelo was asked how he could sculpt a work of art as masterful as his David, he is famously reported to have said: “It is easy. You just chip away the stone that doesn’t look like David.”

Although web scraping is unlike marble sculpting in most other respects, we must take a similar attitude when it comes to extracting the information we’re seeking from complicated web pages. There are many techniques to chip away the content that doesn’t look like the content that we’re searching for, until we arrive at the information we’re seeking. In this chapter, we’ll take look at parsing complicated HTML pages in order to extract only the information we’re looking for.

You Don’t Always Need a Hammer

It can be tempting, when faced with a Gordian Knot of tags, to dive right in and use multiline statements to try to extract your information. However, keep in mind that layering the techniques used in this section with reckless abandon can lead to code that is difficult to debug, fragile, or both. Before getting started, let’s take a look at some of the ways you can avoid altogether the need for advanced HTML parsing!

Let’s say you have some target content. Maybe it’s a name, statistic, or block of text. Maybe it’s buried 20 tags deep in an HTML mush with no helpful tags or HTML attributes to be found. Let’s say you dive right in and write something like the following line to attempt extraction:

bsObj.findAll("table")[4].findAll("tr")[2].find(

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required