The htmllib Module

The htmllib module supplies a class named HTMLParser that subclasses SGMLParser and defines start_ tag, do_ tag, and end_ tag methods for tags defined in HTML 2.0. HTMLParser implements and overrides methods in terms of calls to methods of a formatter object, covered later in this chapter. You can subclass HTMLParser to add or override methods. In addition to the start_ tag, do_ tag, and end_ tag methods, an instance h of HTMLParser supplies the following attributes and methods.

Reference Section

Reference Section

Reference Section

Reference Section

Reference Section

Reference Section

Reference Section

Reference Section

The formatter Module

The formatter module defines formatter and writer classes. You instantiate a formatter by passing to the class a writer instance, and then you pass the formatter instance to class HTMLParser of module htmllib. You can define your own formatters and writers by subclassing formatter’s classes and overriding methods appropriately, but I do not cover this advanced and rarely used possibility in this book. An application with special output requirements would typically define an appropriate writer, subclassing AbstractWriter and overriding all methods, and use class AbstractFormatter without needing to subclass it. Module formatter supplies the following classes.

The htmlentitydefs Module

The htmlentitydefs module supplies just one attribute, a dictionary named entitydefs that maps each entity defined in HTML 2.0 to the corresponding ...

Get Python in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.