O'Reilly logo

Ruby Cookbook by Leonard Richardson, Lucas Carlson

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

11.13. Extracting All the URLs from an HTML Document

Problem

You want to find all the URLs on a web page.

Solution

Do you only want to find links (that is, URLs mentioned in the HREF attribute of an A tag)? Do you also want to find the URLs of embedded objects like images and applets? Or do you want to find all URLs, including ones mentioned in the text of the page?

The last case is the simplest. You can use URI.extract to get all the URLs found in a string, or to get only the URLs with certain schemes. Here we'll extract URLs from some HTML, whether or not they're inside A tags:

	require 'uri'

	text = %{"My homepage is at
	<a href="http://www.example.com/">http://www.example.com/</a>, and be sure
	to check out my weblog at http://www.example.com/blog/. Email me at <a
	href="mailto:bob@example.com">bob@example.com</a>.} 
URI.extract(text)
	# => ["http://www.example.com/", "http://www.example.com/",
	#        "http://www.example.com/blog/.", "mailto:bob@example.com"]

	# Get HTTP(S) links only.
	URI.extract(text, ['http', 'https'])
	# => ["http://www.example.com/", "http://www.example.com/"
	#        "http://www.example.com/blog/."]

If you only want URLs that show up inside certain tags, you need to parse the HTML. Assuming the document is valid, you can do this with any of the parsers in the rexml library. Here's an efficient implementation using REXML's stream parser. It retrieves URLs found in the HREF attributes of A tags and the SRC attributes of IMG tags, but you can customize this behavior by passing a ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required