Turning SAX Events into Data Structures

As described earlier, one of the great strengths of SAX is that it lets applications use appropriate data structures, instead of forcing the use of generic data structures. In Section 3.5.2 in Chapter 3, we looked at the problem of producing SAX events from data structures. Here we look at the reverse process: producing data structures from SAX events. This is a process that most SAX applications handle to one degree or another. One of the most traditional names for this process is unmarshaling; it’s also sometimes called deserializing. (I tend to avoid using the latter term with Java except when talking about RMI.)

We’ll first look at how to turn SAX into generic DOM (and DOM-like) data structures. If you’re working with such data structures, you may find it’s advantageous to build them using SAX. With SAX, you can easily discard data you don’t need, filtering it out so you don’t need to pay its costs. Afterward we’ll look briefly at some of the concerns associated with working with data structures that are more specialized to your application.

SAX-to-DOM Consumers

It’s easy to turn a SAX event stream into a complete DOM document tree, or into a DOM-like data structure such as DOM4J or JDOM. Most open source DOM parsers build those data structures directly from SAX event streams. (Xerces has the only such DOM I know that doesn’t work that way.) Building a DOM document from a SAX2 event stream requires implementing all four event consumer interfaces: ContentHandler, of course; LexicalHandler to report boundaries of entity references and CDATA sections as well as comments; and both DeclHandler and DTDHandler to provide the subset of DTD information that DOM requires. The implementations of those interfaces must use nonstandard DOM functions, because key functionality is missing from public DOM APIs. This means that if you’re using generic code to construct a DOM tree, you won’t be able to implement every behavior DOM specifies. If that doesn’t seem like a feature to you, you’ll need builder code that’s specialized to a particular DOM implementation.

Table 4-1 shows the classes that various DOM implementations provide for turning a SAX2 event stream into a DOM tree.[21] Most classes have configuration options to let you discard some of the minimally useful data, instead saving it and making your application code ignore it later. Except as noted, they implement all four consumer interfaces. Each one has a way to present the DOM data it produces, usually with a getDocument() method; consult documentation (or source code) for full information.

Table 4-1. SAX-to-DOM consumer classes

Implementation

Class name

Comment

Crimson

org.apache.crimson.tree.XmlDocumentBuilder

Implements all the event consumer handlers.

DOM4J

org.dom4j.io.SAXContentHandler

Extends DefaultHandler; does not implement DeclHandler.

GNUJAXP

gnu.xml.dom.Consumer

Uses the gnu.xml.pipeline framework.

JDOM

org.jdom.input.SAXHandler

Extends DefaultHandler.

Example 4-1 uses the DOM implementation from Crimson to illustrate how easy it is to construct a DOM tree from SAX events.

Example 4-1. Converting SAX events to a DOM document (Crimson)

public Document SAX2DOM (String uri)
throws SAXException, IOException
{
    XmlDocumentBuilder	consumer;
    XMLReader		producer;

    consumer = new XmlDocumentBuilder ();

    producer = XMLReaderFactory.createXMLReader ();
    producer.setContentHandler (consumer);
    producer.setDTDHandler (consumer);
    producer.setProperty 
	("http://xml.org/sax/properties/lexical-handler", 
	consumer);
    producer.setProperty 
	("http://xml.org/sax/properties/declaration-handler", 
	consumer);

    producer.parse (uri);
    return consumer.getDocument ();
}

Pruning Noise Data from a DOM Tree

For various historical reasons, DOM provides much information that just adds overhead to applications. When you build a DOM with SAX2, it’s particularly easy to prune that information out of DOM trees: you can simply arrange never to deliver it! Similar techniques are frequently used when feeding SAX event data to a component. It’s often easier to let the component see only parts of the Infoset that you care about than to remove the resulting data noise later.

The simplest example of this would be just to hook up the ContentHandler to a SAX parser and ignore the other three handlers. The resulting DOM will not have DTD information, but that’s no loss, because even DOM Level 2 doesn’t provide enough of the DTD information to be useful. (You can save more complete DTD information using custom SAX handlers, if you need it.) Because the LexicalHandler isn’t provided, you won’t see comment nodes or entity reference nodes (or their read-only children which really complicate your code). Also, any CDATA text nodes will be transparently merged with any adjacent “normal” text nodes. A DOM without such information is a lot easier to work with; your code won’t need to handle special cases that come from storing such data. It will also need somewhat less memory and take less time to construct the DOM tree.

To further streamline your data, override ignorableWhitespace() and discard whitespace characters. While such events won’t always be available even for documents that include DTDs, discarding “ignorable” characters can save significant amounts of memory. The savings vary widely based on DTDs and documents; documents that use mostly elements with element content models (often, but not always, data-oriented DTDs) have the biggest savings. Space savings of ten percent aren’t unreasonable and are coupled with some time savings for DOM tree construction, but such savings are highly data dependent. (You may be able to discard processing instructions, depending on your application.)

Discarding lots of the DOM data is so common that when you use JAXP to build a DOM tree, you can configure it to automatically discard some of the data. (Unfortunately, the default is to include all of that data. You might not even need to strip out the events yourself. That configuration information gets sent directly to the SAX handler code that builds the DOM, and you can usually use it directly without needing to subclass. Example 4-2, a modified version of the previous example, shows this less noisy setup.

Example 4-2. Converting SAX events to DOM, discarding noise (Crimson)

public Document SAX2DOM (String uri)
throws SAXException, IOException
{
    XmlDocumentBuilder	consumer;
    XMLReader		producer;

    consumer = new XmlDocumentBuilder ();
    consumer.setIgnoreWhitespace (true);

    producer = XMLReaderFactory.createXMLReader ();
    producer.setContentHandler (consumer);

    producer.parse (uri);
    return consumer.getDocument ();
}

Building a Partial DOM

Often an even better solution for working with DOM is not to build an entire org.w3c.dom.Document object. You can build just the individual subtrees you need, never paying memory for the rest. Unfortunately, the classes listed earlier are set up to build entire document objects, so they won’t help. However, it’s easy to use SAX events to assemble trees of DOM nodes.

Here’s one way to do it. This example defines an interface that exposes an element type using a namespace URI and a local name. It also exposes an event handler method to call with a DOM subtree that holds only such elements and their children. In effect, DOM subtrees are streamed, rather than SAX events. Such a model could work well with documents that are huge but highly regular, if the subtrees were processed then immediately discarded to save memory. Such structures might represent a series of composite records built from database queries, for example.

Example 4-3 uses JAXP to bootstrap an empty DOM document, which is used as a factory to create DOM elements and text nodes. The factory should be used for attributes too, in a more complete example, and perhaps for processing instructions. Notice how the SAX document traversal exactly matches a walk over the DOM tree being constructed, and how the partial DOM tree serves as only the state that’s needed. Also, that DOM handles namespaces slightly differently than SAX does. If you need to build DOM trees with SAX, your code doesn’t need to be much more complicated than this (other than passing attributes along) unless you try to implement all the gingerbread ornamenting the data model exposed by DOM.

Example 4-3. Using SAX to stream DOM subtrees

import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.*;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;

// a kind of event handler
interface DomListener
{
    public String getURI ();
    public String getLocalName ();
    public void processTree (Element tree) throws SAXException;
}

public class DomFilter extends DefaultHandler
{
    private Document	factory;
    private Element	current;
    private DomListener	listener;

    public DomFilter (DomListener l)
	{ listener = l; }

    public void startDocument ()
    throws SAXException
    {
	// all this just to get an empty document;
	// we need one to use as a factory
	try {
	    factory = DocumentBuilderFactory
		.newInstance ()
		.newDocumentBuilder ()
		.newDocument ();
	} catch (Exception e) {
	    throw new SAXException ("can't get DOM factory", e);
	}
    }

    public void startElement (String uri, String local,
	String qName, Attributes atts)
    throws SAXException
    {
	// start a new subtree, or ignore
	if (current == null) {
	    if (!listener.getURI ().equals (uri)) 
		return;
	    if (!listener.getLocalName ().equals (local)) 
		return;
	    current = factory.createElementNS (uri, qName);

	// Add to current subtree, descend.
	} else {
	    Element	e;

	    if ("".equals (uri))
		e = factory.createElement (qName);
	    else
		e = factory.createElementNS (uri, qName);
	    current.appendChild (e);
	    current = e;
	}
	// NOTE:  this example discards all attributes!
	// They ought to be saved to the current element.
    }

    public void endElement (String uri, String local, String qName)
    throws SAXException
    {
	Node	parent;

	// ignore?
	if (current == null)
	    return;
	parent = current.getParentNode ();

	// end subtree?
	if (parent == null) {
	    current.normalize ();
	    listener.processTree (current);
	    current = null;

	// else climb up one level
	} else
	    current = (Element) current.getParentNode ();
    }

    // if saving, append and continue
    public void characters (char buf [], int offset, int length)
    throws SAXException
    {
	if (current != null)
	    current.appendChild (factory.createTextNode (
		new String (buf, offset, length)));
    }
}

You can use similar techniques to construct other kinds of data structures and to perform more interesting filter functions. For example, perhaps more than one element type is interesting, or some types of elements should be reported through different event handler callbacks. It’s also easy to transform the data as you read it; the DOM trees you construct don’t need to match the document structure that the parser reports.

Turning SAX Events into Custom Data Structures

If your application data structure or interchange syntax is already defined, you may not be able to unmarshal it using software based on the numerous schema-oriented tools. However, lots of software uses SAX to do this efficiently. Once you understand how SAX models data in XML documents, you can treat unmarshaling much like any other parsing problem. It’s closely associated with marshaling your data structures to XML. Here we’ll look at some of the issues you may want to consider when transforming XML into your data structures.

You may find that some individual data items, such as integers and dates, use the low-level encoding rules that are specified in Part 2 of the W3C XML Schema specification (http://www.w3c.org/TR/xmlschema-2/). Those encodings are low-level policy decisions, and they’re conceptually independent of the rest of the W3C Schema; you can use them even if you don’t buy the W3C approach to those schemas. Some other schema systems, such as Relax-NG, incorporate those low-level encoding policies without adopting more problematic parts of the W3C XML Schema specification. Your application might likewise want to use these policies.

One basic high-level encoding issue is how closely the XML structures and application structures should match. For example, an element will be easier to unmarshal by mapping its attributes (or child elements) directly to properties of a single application object rather than by mapping them to properties of several different objects. The latter design is more complex, and for many purposes it could be much more appropriate, but such unmarshaling code needs more complex state.

Regularity of the various structures is another issue. It’s usually less work to handle regular structures, since it’s easy to create general methods and reuse them. Bugs are less frequent and more easily found than when every transformation involves yet another special case.

You’ll need to figure out how much state you need to track and what techniques you will use. You might be able to use extremely simple parsing state machines; one of these is shown later, in Example 6-2. Sometimes it might easier to unmarshal fragments into an intermediate form (as in the DOM subtrees example earlier), and map that form to your application structure before discarding them.

Often some sort of recursive-descent parsing algorithm that explicitly tracks the state of your parsing activities will be useful. It will often be helpful to keep a stack of pending elements and attributes, as shown later (in Example 5-1). But since the XML structures might not map directly to your application structures, you might also need to stack objects you’re in various stages of unmarshaling.

The worst scenario is when neither the XML text nor the application data structures are very regular. Software to work with that kind of system quickly gets fragile as it grows, and you’ll probably want to change some of your application constraints.



[21] As presented in Chapter 3, in Section 3.5.1, most of these packages also support DOM-to-SAX event producers.

Get SAX2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.