A Short History of SAX

The official SAX web site is at http://www.saxproject.org. You will find a more complete history there, with updates for anything that happened after this book went to print, as well as the current software release and its documentation.

SAX1

SAX 1.0 development started in late December 1997, shortly after publication of the last review draft of the XML 1.0 specification. The initial impetus was to permit Java applications to be independent of which parser they used, and to promote uniformity in the data models available to applications without imposing some particular data representation. At that time, several such Java parsers existed (notably Ælfred, Lark, MSXML, and XP), each with their own APIs and feature sets. That approach would clearly be counterproductive and had already caused complications for one early XML browser, Jumbo, used with a Chemical Markup Language built with XML. (See http://www.xml-cml.org for more information about CML and Jumbo.)

Discussion proceeded quickly. The development primarily took place on the open Internet xml-dev mailing list. There was no bureaucracy since it was organized and run by one person, David Megginson, the original author of Ælfred. Essential contributions were made by developers of other Java XML parsers, including Tim Bray, editor of the XML 1.0 Recommendation and author of Lark, and James Clark, Technical Lead of the XML 1.0 Recommendation and author of XP. In terms of openness, the process was similar to those used historically by the Internet Engineering Task Force (IETF) and by many current open source development projects. Unlike the process for most recent Java/XML API standards, helping to define SAX required no nondisclosure agreements, or reassignment of intellectual property rights, and had a transparent process. Public list archives are available, if you want to see how (or why!) some things turned out the way that they did

The initial draft API was published in January 1998, less than a month after initial discussions started. It featured key characteristics still seen today: it was event based, and distinguished interface and implementation without insisting that implementations commit to the overhead of a “provider” glue layer. To improve its coolness factor, it used the org.xml.sax package name, since Jon Bosak owned the “xml.org” DNS domain name and gave approval for that use.[3] Best, it was indeed a Simple API for XML. Discussions continued fast and furious. More developers helped improve these early proposals, including the author of this book.

The SAX1 API was finalized in May 1998, just three months after XML itself was finalized, and was generally well received. Most Java XML parsers quickly adapted to it, and new ones quickly adopted it. At one point, it was possible to find no less than a dozen open source SAX1 parsers. Today, new XML projects tend to build on top of the standard APIs such as SAX, rather than underneath them, since most widely used parsers do support SAX.

SAX2

When SAX1 was finished, there were features it did not address. That was to be expected because of the 80/20 rule. Satisfying the 80% of application requirements that involved only simple functionality meant that only a small handful of applications needed more complex functionality; that handful was much less than 20% of the application space. Notably, anyone who tried to use SAX to round-trip XML data found that important parts were omitted. (Round-tripping a SAX event stream means turning it back into XML text and parsing the result, without losing any data.) Similarly, anyone using SAX to construct a DOM tree found that there was a mismatch: DOM also expected more information to be provided. Although many applications were happy not to see that additional data, it was still a conformance issue. Moreover, since DTD declarations were not available, it wasn’t practical to maintain arbitrary valid documents with a SAX1-only parser. On top of all that, it wasn’t possible to tell if a parser could validate, nor could you change whether or not it was validating; this all but prevented parser-neutral application configuration and setup. As developers learned their way around XML, the 80/20 line shifted so more functionality was needed.

So discussions continued, but at a much slower pace. In late 1998 some draft interfaces were posted, which later became the basis of the two standard SAX2 extensions. (Not many parsers worked with those interfaces until they were fleshed out later in the SAX2 process.) Discussions later in the next year focused on ways to let such additional extension handlers and other new features be added without changing “core” APIs, by supporting parser configurability.

The final catalyst for SAX2 was probably the realization that without parser-level API support, the XML Namespaces specification would probably not be adopted soon with any really standard semantics. Application-specific implementations tended to have bugs in their interpretation of the namespaces specification. (That specification has turned out to cause a surprising amount of confusion.) To make a long story short, further discussions happened, and SAX2 was finalized in May 2000. SAX1 parsers were initially wrapped in adapters that layered the namespace processing, making it easy to convert to use the core SAX2 APIs. The first parser to natively support the full set of SAX2 APIs, including the extension interfaces, was Ælfred2, in the second half of 1999. By the second half of 2000, such support was available in the current releases of most other widely used parsers.

This book focuses on the current SAX2 release, which includes minor bug fixes as well as more robust bootstrapping and clarifications, and explanations for the API documentation.

SAX2 Extensions

As mentioned earlier, one of the original reasons to extend SAX1 was that the SAX core didn’t expose information needed by various applications and, of course, DOM. Not everyone needs or wants that information. A cautionary example is exposing comments, which were never intended to be used (or seen) by applications; they were grandfathered into XML APIs through horrible accidents involving old HTML browsers and DOM. However, lack of such such data was a problem for some applications. That 80/20 rule kept such features at a relatively low priority. The fact that exposing this information called for changes to parser internals ensured that it couldn’t be part of the SAX2 core. (Because information such as a comment was discarded in SAX1, this information couldn’t be layered, in the same way that org.xml.sax.helpers.ParserAdapter does for namespace support.)

The resolution was to decouple development of the SAX2 declaration and lexical handlers from the SAX core and to make them optional. The “SAX-extension” interfaces were not finalized until December 2000, well after the SAX2 core was finalized; at this writing, many of the deployed SAX2 parsers still only support the beta test versions of those interfaces. In practice, most SAX2 parsers do support these two handlers, which are mostly used to develop infrastructure tools. Applications value the “simple” nature of SAX, which lets them focus primarily on a single event handler interface included in the core of SAX.

In the future, most SAX2 extensions will be able to be layered independently of SAX2 parsers. Only very few additional kinds of information appear to need standardized support from inside such parsers.[4] Today, most new XML technologies are defined as layers above the XML Infoset, so they can (and should!) be implemented as layers above SAX2-based parsers rather than within them.

Is SAX2 a “Standard”?

In a word, yes: SAX is a shining example of a de facto standard API. You will have a hard time finding an XML parser written in Java that doesn’t support SAX. In contrast to the recent spate of standards originated by a formal de jure standards body (notably the International Standards Organization, or the ISO), or to specifications pushed by vendors or a vendor-dominated consortium, SAX2 is a standard in the more classic sense. It was hammered into shape by users, quenched in the fire of real-world use, and adopted as a tool after it proved its worth. This partially motivates its small size and clear focus: it had a clear mission, and little of the “mission creep” pressure often caused by standards organization politics. This also explains why its legal status may seem to be unique; SAX is in the public domain, not copyrighted or controlled by any corporation or consortium.

You may be familiar with other examples of technology standards that were developed similarly. For example, the sockets network API widely used for TCP was popularized at the University of California at Berkeley; from there it migrated into other Unix systems and then into Microsoft Windows. Similar processes occured for other core Unix APIs and the standard C Library functions. (Some have entered de jure standardization processes through the ANSI or IEEE POSIX processes, or have been adopted by vendor consortiums like the one that produced the UNIX98 API set.)

The same sort of process has been happening with SAX. From its initial base in Java, it’s been imported into many XML-programming tool sets in Python, Perl, Pascal, and JavaScript. There are several different C/C++ versions, and Microsoft has even provided SAX-like COM interfaces. Each new environment has made changes and adaptations. Some have remained truer to the original API (in Java) than others, but it looks as if this growth will only continue. In the best and most classic sense, SAX is a standard.

Sun’s Java API for XML Processing (JAXP)

As of JDK 1.4, SAX2 has been incorporated into the Java2 Standard Edition (J2SE) through Sun’s Java Community Process. It’s part of Version 1.1 of Sun’s Java API for XML Processing (JAXP). A JAXP implementation is bundled with JDK 1.4 releases, and is available separately for use with other Java-compliant platforms (JDK 1.1 and later). The Java2 Enterprise Edition (J2EE) has recognized JAXP for some time, and web applications that use servlets have long been using XML and SAX.

From the perspective of this book, JAXP is just a vehicle to get SAX2 interfaces (and the Crimson parser) into the hands of more Java developers. Sun incorporated these standard APIs directly into their API set, exactly as one would desire. In this case, the real community process had completed before Sun’s own process started. The stamp of recognition provided by Sun facilitated further adoption; some organizations are uncomfortable with software that has no such recognition.

JAXP 1.1 also incorporates DOM Level 2 and some other APIs, including TRAX, a wrapper for XSLT-based transformations. TRAX offers limited SAX support; it supports producing partial SAX event streams as output or sometimes inputs. It’s worth noting that if you use DOM, JAXP solves a critical portability problem for you. JAXP has the first “standard” Java solution for vendor-independent bootstrapping with DOM. DOM Level 3 plans to address that problem, but JAXP will have solved it years before Level 3 becomes widely available. If you don’t use JAXP’s DOM bootstrap APIs, you must use vendor-specific APIs to get a document object that’s populated with the content of any XML text. Starting down the path of such vendor-specific APIs quickly leads to nonportable code. SAX has never had that problem, because it has always included vendor-neutral bootstrapping APIs. (Although JAXP defines additional SAX bootstrapping APIs, this book discourages their use.)



[3] This domain was subsequently transferred to the OASIS group, which later took over operations for the xml-dev mailing list. However, SAX still remains independent of OASIS. SAX is currently maintained using SourceForge.net project resources.

[4] See the SAX we site for more information about these. Also, some probable new extensions are noted in Appendix B.

Get SAX2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.