Some Popular SAX2 Parser Distributions

Today a variety of high-quality SAX2 parsers are available. Increasingly, they are packaged with Java programming environments, so you may not need to fetch one yourself unless you need upgrades (or bug fixes), or are constructing such a programming environment yourself (perhaps packaging an embedded system or a standalone application). You should be able to bootstrap any SAX parser. As a rule, if an XML parser is part of your Java programming environment, it already supports SAX and probably SAX2. The documentation should say whether SAX2 is supported. If it only mentions SAX1, you can upgrade to get most of the core SAX2 features; see Section 5.2, in Chapter 5, for more information.

If your programming environment doesn’t include a SAX parser, you’ll need to get and install one. This section provides a brief summary of some of the most widely available open source SAX2 parsers.[5] These packages all include SAX2, DOM Level 2, and JAXP 1.1 support, and can validate XML for you. They also have full support for the standard SAX2 extensions. If you don’t happen to download documentation that includes the SAX2 documentation, it’ll be available from the same site as the parser. All of these perform well in most applications, as long as you avoid the memory penalties of DOM.

Current versions of all these parsers do quite well on the open source SAX/XML conformance tests, available at http://xmlconf.sourceforge.net/java/. Those tests verify that these processors report essential information required of a SAX1 processor, and evaluate how well they support the XML 1.0 specification. SAX2 conformance testing isn’t yet as well advanced, though some tests are now available.

In addition to a SAX2 parser, you will likely want to have some SAX2/XML utilities that are layered on top of that parser. The packages described here include a DOM implementation, which is normally provided as a clean layer over SAX2. You might also consider other more Java-friendly packages such as DOM4J (http://www.dom4j.org) or JDOM (http://www.jdom.org), both of which are layered over SAX2, as well as other APIs that provide more data-structure options. When you’re learning SAX, having access to the source code of tools and applications built with SAX can help you learn the API, at least if it’s high-quality source that uses the SAX APIs correctly.

Ælfred2

One of the original XML parsers mentioned earlier, Ælfred, has long been recognized for its simplicity, small size, and good performance. As XML parsers go, it is easy to read and understand. With a different maintainer (your humble author), this parser was updated to be the first with full native SAX2 support, and to substantially improve its conformance to the XML specification. This updated version is called Ælfred2, and versions have been incorporated in a variety of applications where its simplicity, size, and conformance are compelling features. It is now part of the GNU Classpath Extensions project and forms the core of the GNU JAXP library.

The updated version has taken SAX2 further than most other parsers. It has a highly modular structure; the reference distribution is able to use an optional “stream validator” that uses the SAX2 events. The model of an XML pipeline of such events is a natural and powerful way to think about SAX; the SAX2 pipeline package in this distribution lets applications compose arbitrary processing modules in series or parallel. This style of SAX2 processing is emphasized in this book, and some of the examples show how to use these advanced components. Validation and DOM support remain completely modular, and use SAX event pipelines, so Ælfred can still be distributed as a lightweight nonvalidating parser without those components. Likewise, the validation and DOM support don’t need Ælfred to work.

The current version of Ælfred is licensed under the GNU General Public License (GPL), with the “library exception” clause to ensure that it can be used in proprietary applications (notably, embedded systems) that aren’t themselves licensed under the GPL. That license is used with many GNU libraries, such as the GCC Java (GCJ) runtime libraries. Ælfred includes a gnujaxp.jar file that needs installation.

See http://www.gnu.org/software/classpathx/jaxp/ for information about the current distribution of Ælfred.

Crimson

Sun, through Java Project X in its Java division, was one of the earliest major Java vendors to support SAX and XML namespaces. This parser was the first to demonstrate that XML could be validated without a significant penalty. It was dozens of times faster than its competitors and offered more XML conformance. History buffs may like to know that its validation was based on some of the SGML/HTML validation code from the HotJava web browser, the original Java-and-the-Web showpiece software package. This XML code ties directly to some of the earliest Java software seen outside of JavaSoft.

Crimson is a version of the Java Project X software, updated to support SAX2, DOM Level 2, and JAXP 1.1 (for which it is the reference implementation). It was submitted to the Apache XML project to help trigger a “best of breed” XML parser.

Crimson is licensed under the Apache Software License. The Crimson parser has been incorporated into Sun’s JDK 1.4 release as its standard XML parser. It is separately distributed as the reference parser for JAXP, so most JAXP distributions include it. This book describes Crimson Version 1.1.3 (matching JDK 1.4), dated October 2001, which includes jaxp.jar and crimson.jar files that need installation.

See http://java.sun.com/xml/ for information about this distribution.

Xerces

Xerces is a family of XML parsers in the Apache XML project; in this book, we refer only to the Java version, not the C/C++ version. It has evolved from the second generation of IBM’s XML for Java (XML4J) parser, and much of its development and maintenance is still handled by IBM. It is relatively large, and is monolithic rather than modular. It also supports many nonstandard extensions. For example, validation against W3C’s XML schemas is part of the parser, rather than a layered feature.

Xerces v2 is a third-generation project. Goals of that project include a more maintainable and modular design. It includes an internal XML event pipeline model, which is strikingly similar to that used in Ælfred to layer validation and DOM support, except that it doesn’t use SAX2 to represent the XML Infoset data.

Xerces is licensed under the Apache Software License. This book describes Xerces Version 1.4.3, dated August 2001, which includes a xerces.jar file that needs installation.

See http://xml.apache.org/ for information about this distribution.



[5] Proprietary SAX2 parsers exist, such as one from Oracle that is commonly used in Oracle-hosted server-side applications. More information is available on the Oracle web site, http://www.oracle.com/xml/.

Get SAX2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.