Name
unknown_starttag
Synopsis
s
.unknown_starttag(tag
,attributes
)
Called to process tags for which X
supplies no specific method. tag
is the
tag string, lowercased. attributes
is a
list of pairs
(
name
,value
)
,
where name
is each
attribute’s name, lowercased, and
value
is the value, processed to resolve
entity references and character references and to remove surrounding
quotes. SGMLParser
’s
implementation of unknown_starttag
does nothing.
The following example uses sgmllib
for a typical
HTML-related task: fetching a page from the Web with
urllib
, parsing it, and outputting the hyperlinks.
The example uses urlparse
to check the
page’s links, and outputs only links whose URLs have
an explicit scheme of 'http
‘.
import sgmllib, urllib, urlparse class LinksParser(sgmllib.SGMLParser): def __init_ _(self): sgmllib.SGMLParser.__init_ _(self) self.seen = {} def do_a(self, attributes): for name, value in attributes: if name == 'href' and value not in self.seen: self.seen[value] = True pieces = urlparse.urlparse(value) if pieces[0] != 'http': return print urlparse.urlunparse(pieces) return p = LinksParser( ) f = urllib.urlopen('http://www.python.org/index.html') BUFSIZE = 8192 while True: data = f.read(BUFSIZE) if not data: break p.feed(data) p.close( )
Class LinksParser
only needs to define method
do_a
. The superclass calls back to this method for
all <a>
tags, and the method loops on the
attributes, looking for one named 'href
', then works with the corresponding value (i.e., the relevant URL). ...
Get Python in a Nutshell now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.