The htmllib Module
The htmlib
module contains a tag-driven HTML parser, which sends data to a
formatting object. Example 5-9 uses this module. For more examples on how to parse HTML files using
this module, see the descriptions of the formatter
module.
Example 5-9. Using the htmllib Module
File: htmllib-example-1.py
import htmllib
import formatter
import string
class Parser(htmllib.HTMLParser):
# return a dictionary mapping anchor texts to lists
# of associated hyperlinks
def _ _init_ _(self, verbose=0):
self.anchors = {}
f = formatter.NullFormatter()
htmllib.HTMLParser._ _init_ _(self, f, verbose)
def anchor_bgn(self, href, name, type):
self.save_bgn()
self.anchor = href
def anchor_end(self):
text = string.strip(self.save_end())
if self.anchor and text:
self.anchors[text] = self.anchors.get(text, []) + [self.anchor]
file = open("samples/sample.htm")
html = file.read()
file.close()
p = Parser()
p.feed(html)
p.close()
for k, v in p.anchors.items():
print k, "=>", v
print
link => ['http://www.python.org']
If you’re only out to parse an HTML file and not render it to an
output device, it’s usually easier to use the sgmllib
module instead.
Get Python Standard Library now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.