Parsing Some HTML

Problem

You want to pull the strings out of some HTML. For example, you’d like to get at the href="urlstringstuff" type strings from the <a> tags within a chunk of HTML.

Solution

For a quick and easy shell parse of HTML, provided it doesn’t have to be foolproof, you might want to try something like this:

cat $1 | sed -e 's/>/>\
/g' | grep '<a' | while IFS='"' read a b c ; do echo $b; done

Discussion

Parsing HTML from bash is pretty tricky, mostly because bash tends to be very line oriented whereas HTML was designed to treat newlines like whitespace. So it’s not uncommon to see tags split across two or more lines as in:

<a href="blah...blah...blah
  other stuff >

There are also two ways to write <a> tags, one with a separate ending </a> tag, and one without, where instead the singular <a> tag itself ends with a />. So, with multiple tags on a line and the last tag split across lines, it’s a bit messy to parse, and our simple bash technique for this is often not foolproof.

Here are the steps involved in our solution. First, break the multiple tags on one line into at most one line per tag:

cat file | sed -e 's/>/>\
/g'

Yes, that’s a newline right after the backslash so that it substitutes each end-of-tag character (i.e., the >) with that same character and then a newline. That will put tags on separate lines with maybe a few extra blank lines. The trailing g tells sed to do the search and replace globally, i.e., multiple times on a line if need be.

Then you can pipe that output ...

Get bash Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.