Chapter 8. Structured Text

In Chapter 6, we took a very brief look at the csv module that is used to read and write lines of tab- or comma-separated values, with each line corresponding to one item in the file. We’ve also looked at a variety of ways to scan files looking for certain patterns of data, including using str methods and regular expressions. Files that are in tab- or comma-separated values format, FASTA files, GenBank files, and many other file formats encountered in bioinformatics work are called flat files.[35] What is “flat” about them is that they are just text files: the data has no explicit structure beyond agreed-on conventions regarding special characters, blank lines, whitespace, etc. They can have introductory material before the data, other material after the data, several sets of data in one file, and so on.

The opposite of “flat” in this context is structured. A structured text file contains elements, each of which can have attributes and/or “sub” or child elements. There can be different kinds of elements, and in general there are rules specifying what attributes and children each kind of element can have. The linear approaches for processing text files that we’ve seen so far are inadequate for structured files, essentially because the files are two-dimensional. This chapter describes some ways to process structured files.

HTML

An obvious example of a structured file format is basic HTML. (We’ll ignore all the fancy stuff like JavaScript, frames, and so on.) ...

Get Bioinformatics Programming Using Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.