Chapter 26. Harvesting Web Information

Introduction

The Web contains information galore. Much of this information is freely available by simply surfing over to an organization’s web site and reading their pages or search results. However, it can be difficult separating the dross from the gems. The vast majority of a web page’s visual components are typically dedicated to menus, logos, advertising banners, and fancy applets or Flash movies. What if all you are interested in is a tiny nugget of data awash in an ocean of HTML?

The answer lies in using Java to parse a web page to extract only certain pieces of information from it. The web terms for this task are harvesting or scraping information from a web page. Perhaps web services (Chapter 27) will eventually replace the need to harvest web data. But until most major sites have their web services APIs up and running, you can use Java and certain javax.swing.text subpackages to pull specified text from web pages.

How does it work? Basically, your Java program uses HTTP to connect with a web page and pull in its HTML text.

Tip

Parsing the HTML from web sites still involves transferring the entire web page over the network, even if you are only interested in a fraction of its information. This is why using web services is a much more efficient manner of sharing specific data from a web site.

Then use Java code to parse the HTML page in order to pull from it only the piece of data you are interested in, such as weather data or a stock quote. ...

Get Java Servlet & JSP Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.