Extracting HTML from a URL
Problem
You need to extract all the HTML tags from a URL.
Solution
Use this simple HTML tag extractor.
Discussion
A simple HTML extractor can be made by reading a character at a time
and looking for < and > tags. This is reasonably efficient if a
BufferedReader
is used.
The ReadTag
program shown in Example 17-5 implements this; given a URL, it opens the
file (similar to TextBrowser
in Section 17.7) and extracts the HTML tags. Each tag is
printed to the standard output.
Example 17-5. ReadTag.java
/** A simple but reusable HTML tag extractor. */ public class ReadTag { /** The URL that this ReadTag object is reading */ protected URL myURL = null; /** The Reader for this object */ protected BufferedReader inrdr = null; /* Simple main showing one way of using the ReadTag class. */ public static void main(String[] args) throws MalformedURLException, IOException { if (args.length == 0) { System.err.println("Usage: ReadTag URL [...]"); return; } for (int i=0; i<args.length; i++) { ReadTag rt = new ReadTag(args[0]); String tag; while ((tag = rt.nextTag( )) != null) { System.out.println(tag); } rt.close( ); } } /** Construct a ReadTag given a URL String */ public ReadTag(String theURLString) throws IOException, MalformedURLException { this(new URL(theURLString)); } /** Construct a ReadTag given a URL */ public ReadTag(URL theURL) throws IOException { myURL = theURL; // Open the URL for reading inrdr = new BufferedReader(new InputStreamReader(myURL.openStream( ...
Get Java Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.