What You Get Is Not What You Saw

The XML specification provides several loopholes that permit XML parsers to play fast and loose with your document’s literal contents, while retaining the semantic meaning. Comments can be omitted and entity references silently replaced by the parser without any warning to the client application. Non-validating parsers aren’t required to retrieve external DTDs or entities, although the parser should at least warn applications that this is happening. While reconstructing an XML document with exactly the same logical structure and content is possible, guaranteeing that it will match the original in a byte-by-byte comparison generally is not.

Tip

XML Canonicalization defines a more consistent form of XML and a process for producing it that permits a much higher degree of predictability in reconstructing a document from its logical model. For details, see http://www.w3.org/TR/xml-c14n.

Authors of simple XML processing tools that act on data without storing or modifying it might not consider these constraints particularly restrictive. The ability to reconstruct an XML document precisely from in-memory data structures, however, becomes more critical for authors of XML editing tools and content-management solutions. While no parser is required to make all comments, whitespace, and entity references available from the parse stream, many do or can be made to do so with the proper configuration options.

The only real option to ensure that a parser reports documents ...

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.