Parsing Some HTML
Problem
You want to pull the strings out of some HTML. For example, you’d like to get at the
href="
urlstringstuff"
type strings from the <a>
tags
within a chunk of HTML.
Solution
For a quick and easy shell parse of HTML, provided it doesn’t have to be foolproof, you might want to try something like this:
cat $1 | sed -e 's/>/>\ /g' | grep '<a' | while IFS='"' read a b c ; do echo $b; done
Discussion
Parsing HTML from bash is pretty tricky, mostly because bash tends to be very line oriented whereas HTML was designed to treat newlines like whitespace. So it’s not uncommon to see tags split across two or more lines as in:
<a href="blah...blah...blah other stuff >
There are also two ways to write <a>
tags, one with a separate ending
</a>
tag, and one without,
where instead the singular <a>
tag itself ends with a />
. So,
with multiple tags on a line and the last tag split across lines, it’s a
bit messy to parse, and our simple bash technique
for this is often not foolproof.
Here are the steps involved in our solution. First, break the multiple tags on one line into at most one line per tag:
cat file | sed -e 's/>/>\ /g'
Yes, that’s a newline right after the backslash so that it
substitutes each end-of-tag character (i.e., the >
) with that same character and then a
newline. That will put tags on separate lines with maybe a few extra
blank lines. The trailing g
tells
sed
to do the search and replace globally, i.e., multiple
times on a line if need be.
Then you can pipe that output ...
Get bash Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.