Chapter 5. File Formats

Overview

This chapter describes a number of modules that are used to parse different file formats.

Markup Languages

Python comes with extensive support for the Extensible Markup Language (XML) and Hypertext Markup Language (HTML) file formats. Python also provides basic support for Standard Generalized Markup Language (SGML).

All these formats share the same basic structure because both HTML and XML are derived from SGML. Each document contains a mix of start tags, end tags, plain text (also called character data), and entity references, as shown in the following:

<document name="sample.xml">
    <header>This is a header</header>
    <body>This is the body text.  The text can contain
    plain text (&quot;character data&quot;), tags, and
    entities.
    </body>
</document>

In the previous example, <document>, <header>, and <body> are start tags. For each start tag, there’s a corresponding end tag that looks similar, but has a slash before the tag name. The start tag can also contain one or more attributes, like the name attribute in this example.

Everything between a start tag and its matching end tag is called an element. In the previous example, the document element contains two other elements: header and body.

Finally, &quot; is a character entity. It is used to represent reserved characters in the text sections. In this case, it’s an ampersand (&), which is used to start the entity itself. Other common entities include &lt; for less than (<), and

Get Python Standard Library now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.