O'Reilly logo

Shell Scripting: Expert Recipes for Linux, Bash, and More by Steve Parker

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Recipe 18-1: Parsing HTML

HTML is a very common markup language, but there is a lot of poorly written HTML out there, which makes parsing such a file quite difficult. This recipe shows a structure that strips the tags (<a>, <li>, and so on) from the HTML. The downloader.sh script acts on the <a> tags by saving the linked URL to a file named after the anchor text. Input of <a href="http://www.example.com/">This is an example web site</a> will download the index page of www.example.com to a file called “This is an example web site.”

Technologies Used

  • tr
  • ((suffix++))
  • wget

Concepts

The actual action taken by this recipe is not particularly relevant; wget -Fi is capable of doing something very similar to what this script achieves, but this script is really about stripping tags from the HTML input.

Some HTML terminology is used in this recipe; in the input <a href="/eg.shtml">example pages</a>, /eg.shtml is the link, and example pages is the anchor text. By default, the anchor text is displayed in blue underlined text in the browser, and the link is the address of the page that will be displayed if the anchor text is clicked.

The recipe uses a very crude state machine to keep track of what position in the HTML input the script has reached. Without this, it would be necessary to make many more assumptions about the format of the input file.

Potential Pitfalls

There are a number of pitfalls in processing HTML; there is no single definition of the language, although most HTML today is ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required