O'Reilly logo

Ruby Cookbook by Leonard Richardson, Lucas Carlson

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

11.5. Parsing Invalid Markup

Problem

You need to extract data from a document that's supposed to be HTML or XML, but that contains some invalid markup.

Solution

For a quick solution, use Rubyful Soup, written by Leonard Richardson and found in the rubyful_soup gem. It can build a document model even out of invalid XML or HTML, and it offers an idiomatic Ruby interface for searching the document model. It's good for quick screen-scraping tasks or HTML cleanup.

	require 'rubygems'
	require 'rubyful_soup'

	invalid_html = 'A lot of <b class=1>tags are <i class=2>never closed.'
	soup = BeautifulSoup.new(invalid_html)
	puts soup.prettify
	# A lot of
	#  <b class="1">tags are
	#   <i class="2">never closed.
	#   </i>
	#  </b>

	soup.b.i                                       # => <i class="2">never closed.</i>
	soup.i                                         # => <i class="2">never closed.</i>
	soup.find(nil, :attrs=>{'class' => '2'}) # => <i class="2">never closed.</i>
	soup.find_all('i')                             # => [<i class="2">never closed.</i>]

	soup.b['class']                                # => "1"

	soup.find_text(/closed/)                       # => "never closed."

If you need better performance, do what Rubyful Soup does and write a custom parser on top of the event-based parser SGMLParser (found in the htmltools gem). It works a lot like REXML's StreamListener interface.

Discussion

Sometimes it seems like the authors of markup parsers do their coding atop an ivory tower. Most parsers simply refuse to parse bad markup, but this cuts off an enormous source of interesting data. Most of the pages on the World Wide Web are invalid HTML, so if your application ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required