The sgmllib Module

The name of the sgmllib module is misleading: sgmllib parses only a tiny subset of SGML, but it is still a good way to get information from HTML files. sgmllib supplies one class, SGMLParser, which you subclass, overriding methods. The most frequently used methods of an instance s of your subclass X of SGMLParser are as follows.

close

s.close( )

Tells the parser that there is no more input data. When X overrides close, s.close must call SGMLParser.close to ensure that buffered data is processed.

do_tag

s.do_tag(attributes)

X supplies a method with such a name for each tag, with no corresponding end tag, that X wants to process. tag must be lowercase in the method name, but can be in any case in the parsed text (the SGML standard, like HTML, is case-insensitive, in contrast to XML and XHTML, which are case-sensitive). SGMLParser’s handle_tag method calls do_tag when appropriate. attributes is a list of pairs (name,value), where name is an attribute’s name, lowercased, and value is the value, processed to resolve entity and character references and remove surrounding quotes.

end_tag

s.end_tag()

X supplies a method with such a name for each tag whose end tag X wants to process. tag must be lowercase in the method name, but can be in any case in the parsed text. X must also supply a method named start_tag; otherwise, end_tag is ignored. SGMLParser’s handle_endtag method calls end_tag when appropriate.

feed

s.feed(data)

Passes to the parser some of the text being parsed. The ...

Get Python in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.