To Read the DTD or Not To Read the DTD?

DTDs come in two forms: internal and external and sometimes both. The XML specification requires all parsers to read the internal DTD subset. Validation requires reading the external DTD subset (if any); but if you don’t validate, this is optional. Reading the external DTD subset takes extra time, especially if the DTD is large and/or stored on a remote network host, so you may not want to load it if you’re not validating. Most parsers provide options to specify whether the external DTD subset and other external entities should be resolved. If validation were all a DTD did, then the decision of whether to load the DTD would be easy. Unfortunately, DTDs also augment a document’s infoset with several important properties, including:

  • Entity definitions

  • Default attribute values

  • Whether boundary whitespace is ignorable

At the extreme, since a document with a malformed DTD is itself malformed, a DTD can make a document readable or unreadable. This means whether a parser reads the external DTD subset or not can have a significant impact on what the parser reports. For maximum interoperability documents should be served without external DTD subsets. In this case parser behavior is deterministic and reproducible, regardless of configuration. On the flip side a consumer of XML documents should attempt to read any external DTD subset the document references if they want to be sure of receiving what the sender intended. Be conservative in what you send (don’t ...

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.