The HTMLParser Module

Module HTMLParser supplies one class, HTMLParser, that you subclass to override and add methods. HTMLParser.HTMLParser is similar to sgmllib.SGMLParser, but is simpler and able to parse XHTML as well. The main differences between HTMLParser and SGMLParser are the following:

  • HMTLParser does not call back to methods named do_ tag, start_ tag, and end_ tag. To process tags and end tags, your subclass X of HTMLParser must override methods handle_starttag and/or handle_endtag and check explicitly for the tags it wants to process.

  • HMTLParser does not keep track of, nor check, tag nesting in any way.

  • HMTLParser does nothing, by default, to resolve character and entity references. Your subclass X of HTMLParser must override methods handle_charref and/or handle_entityref if it needs to perform processing of such references.

The most frequently used methods of an instance h of a subclass X of HTMLParser are as follows.

Get Python in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.