The BeautifulSoup Extension

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) lets you parse HTML that may be badly formed and uses simple heuristics to compensate for likely HTML brokenness (it succeeds in this difficult task with surprisingly good frequency). Module BeautifulSoup supplies a class, also named BeautifulSoup, which you instantiate with either a file-like object (which is read to give the HTML text to parse) or a string (which is the text to parse). The module also supplies other classes (BeautifulStoneSoup and ICantBelieveItsBeautifulSoup) that are quite similar, but suitable for slightly different XML parsing tasks. An instance b of class BeautifulSoup supplies many attributes and methods to ease the task of searching for information in the parsed HTML input, returning instances of classes Tag and NavigableText, which in turn let you keep navigating or dig for more information.

Parsing HTML with BeautifulSoup

The following example uses BeautifulSoup to perform the same task as previous examples: fetch a page from the Web with urllib, parse it, and output the hyperlinks:

import urllib, urlparse, BeautifulSoup

f = urllib.urlopen('http://www.python.org/index.html')
b = BeautifulSoup.BeautifulSoup(f)

seen = set( )
for anchor in b.fetch('a'):
    url = anchor.get('href')
    if url is None or url in seen: continue
    seen.add(url)
    pieces = urlparse.urlparse(url)
    if pieces[0]=='http':
        print urlparse.urlunparse(pieces)

The example calls the fetch method of class BeautifulSoup.BeautifulSoup ...

Get Python in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.