Name

unknown_starttag

Synopsis

                     s.unknown_starttag(tag, attributes)

Called to process tags for which X supplies no specific method. tag is the tag string, lowercased. attributes is a list of pairs ( name,value ), where name is each attribute’s name, lowercased, and value is the value, processed to resolve entity references and character references and to remove surrounding quotes. SGMLParser’s implementation of unknown_starttag does nothing.

The following example uses sgmllib for a typical HTML-related task: fetching a page from the Web with urllib, parsing it, and outputting the hyperlinks. The example uses urlparse to check the page’s links, and outputs only links whose URLs have an explicit scheme of 'http‘.

import sgmllib, urllib, urlparse

class LinksParser(sgmllib.SGMLParser):
    def __init_  _(self):
        sgmllib.SGMLParser.__init_  _(self)
        self.seen = {}
    def do_a(self, attributes):
        for name, value in attributes:
            if name == 'href' and value not in self.seen:
                self.seen[value] = True
                pieces = urlparse.urlparse(value)
                if pieces[0] != 'http': return
                print urlparse.urlunparse(pieces)
                return

p = LinksParser( )
f = urllib.urlopen('http://www.python.org/index.html')
BUFSIZE = 8192
while True:
    data = f.read(BUFSIZE)
    if not data: break
    p.feed(data)
p.close( )

Class LinksParser only needs to define method do_a. The superclass calls back to this method for all <a> tags, and the method loops on the attributes, looking for one named 'href', then works with the corresponding value (i.e., the relevant URL). ...

Get Python in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.