O'Reilly logo

Beautiful Code by Andy Oram, Greg Wilson

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

How Simple Can an HTML Parser Be?

In addition to being an open framework, FIT presents some other surprising design choices. Earlier, I mentioned that all of FIT's HTML parsing is done by the Parse class. One of the things that I love the most about the Parse class is that it constructs an entire tree with its constructors.

Here's how it works. You create an instance of the class with a string of HTML as a constructor argument:

	String input = read(new File(argv[0]);Parse parse = new Parse(input);

The Parse constructor recursively constructs a tree of Parse instances, each of which represents a portion of the HTML document. The parsing code is entirely within the constructors of Parse.

Each Parse instance has five public strings and two references to other Parse objects:

	public String leader;
	public String tag;
	public String body;
	public String end;
	public String trailer;

	public Parse more;
	public Parse parts;

When you construct your first Parse for an HMTL document, in a sense, you've constructed all of them. From that point on, you can use more and parts to traverse nodes. Here's the parsing code in the Parse class:

 static String tags[] = {"table", "tr", "td"}; public Parse (String text) throws ParseException { this (text, tags, 0, 0); } public Parse (String text, String tags[]) throws ParseException { this (text, tags, 0, 0); } public Parse (String text, String tags[], int level, int offset) throws ParseException { String lc = text.toLowerCase( ); int startTag = lc.indexOf("<"+tags[level]); ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required