Recipe 18-1: Parsing HTML
HTML is a very common markup language, but there is a lot of poorly written HTML out there, which makes parsing such a file quite difficult. This recipe shows a structure that strips the tags (<a>, <li>, and so on) from the HTML. The downloader.sh script acts on the <a> tags by saving the linked URL to a file named after the anchor text. Input of <a href="http://www.example.com/">This is an example web site</a> will download the index page of www.example.com to a file called “This is an example web site.”
The actual action taken by this recipe is not particularly relevant; wget -Fi is capable of doing something very similar to what this script achieves, but this script is really about stripping tags from the HTML input.
Some HTML terminology is used in this recipe; in the input <a href="/eg.shtml">example pages</a>, /eg.shtml is the link, and example pages is the anchor text. By default, the anchor text is displayed in blue underlined text in the browser, and the link is the address of the page that will be displayed if the anchor text is clicked.
The recipe uses a very crude state machine to keep track of what position in the HTML input the script has reached. Without this, it would be necessary to make many more assumptions about the format of the input file.
There are a number of pitfalls in processing HTML; there is no single definition of the language, although most HTML today is ...