Beyond string objects and regular expressions, Python ships with support for parsing some specific and commonly used types of formatted text. In particular, it provides precoded parsers for XML and HTML which we can deploy and customize for our text processing goals.
In the XML department, Python includes parsing support in its standard library and plays host to a prolific XML special-interest group. XML (for eXtensible Markup Language) is a tag-based markup language for describing many kinds of structured data. Among other things, it has been adopted in roles such as a standard database and Internet content representation in many contexts. As an object-oriented scripting language, Python mixes remarkably well with XML’s core notion of structured document interchange.
XML is based upon a tag syntax familiar to web page writers,
used to describe and package data. The
xml module package in Python’s standard library includes
tools for parsing this data from XML text, with
both the SAX and the DOM standard parsing models, as well as the
Python-specific ElementTree package. Although regular expressions can
sometimes extract information from XML documents, too, they can be
easily misled by unexpected text, and don’t directly support the
notion of arbitrarily nested XML constructs (more on this limitation
later when we explore languages in general).
In short, SAX parsers provide a subclass with methods called during the parsing operation, and DOM parsers are given access ...