Chapter 5. File Formats
Overview
This chapter describes a number of modules that are used to parse different file formats.
Markup Languages
Python comes with extensive support for the Extensible Markup Language (XML) and Hypertext Markup Language (HTML) file formats. Python also provides basic support for Standard Generalized Markup Language (SGML).
All these formats share the same basic structure because both HTML and XML are derived from SGML. Each document contains a mix of start tags, end tags, plain text (also called character data), and entity references, as shown in the following:
<document name="sample.xml"> <header>This is a header</header> <body>This is the body text. The text can contain plain text ("character data"), tags, and entities. </body> </document>
In the previous example, <document>
,
<header>
, and <body>
are start tags. For each start tag, there’s a corresponding end tag
that looks similar, but has a slash before the tag name. The start
tag can also contain one or more attributes, like
the name
attribute in this example.
Everything between a start tag and its matching end tag is called an
element. In the previous example, the
document
element contains two other elements:
header
and body
.
Finally, "
is a character entity. It is
used to represent reserved characters in the text sections. In this
case, it’s an ampersand (&
), which is used to
start the entity itself. Other common entities include
<
for “less than”
(<
), and
Get Python Standard Library now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.