Chapter 22. Structured Text: HTML

Most documents on the web use HTML, the HyperText Markup Language. Markup is the insertion of special tokens, known as tags, in a text document, to structure the text. HTML is, in theory, an application of the large, general standard known as SGML, the Standard General Markup Language. In practice, many documents on the web use HTML in sloppy or incorrect ways. Browsers have evolved heuristics over the years to compensate for this, but even so, it still happens that a browser displays a wrongly marked-up web page in weird ways (don’t blame the browser, if it’s a modern one: 9 times out of 10, the blame is on the web page’s author).

HTML1 is not suitable for much more than presenting documents on a browser. Complete, precise extraction of the information in the document, working backward from what most often amounts to the document’s presentation, often turns out to be unfeasible. To tighten things up, HTML tried evolving into a more rigorous standard called XHTML. XHTML is similar to traditional HTML, but it is defined in terms of XML, and more precisely than HTML. You can handle well-formed XHTML with the tools covered in Chapter 23. However, as of this writing, XHTML does not appear to have enjoyed overwhelming success, getting scooped instead by the (non-XML) newest version, HTML5.

Despite the difficulties, it’s often possible to extract at least some useful information from HTML documents (a task known as screen-scraping, or just scraping ...

Get Python in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.