Understanding the DOM

The DOM structure is essentially a hierarchy of node objects. Beginning with the root of the document (not the same as the document element), all constructs in the document are represented by nodes of various types, whether an element, text, attributes of elements, or other less common node types. Each node contains a list of references to child nodes, which can in turn be of the same types as those contained by the parent node. Therefore, a complete document looks just like a tree, all the way from the “trunk” (or root element of the tree) out to the leaf nodes representing text, childless elements, comments, processing instructions, and possibly other constructs. Figure 4-1 shows a very simple DOM hierarchy including a root element, two child elements, and their respective child text elements. Usually the character data of an element consists of multiple text nodes depending on the parser in use. Contiguous strings of textual data become sequences of text nodes.

A simple DOM hierarchy

Figure 4-1. A simple DOM hierarchy

When a document is represented by the DOM, an object hierarchy represents the entire document. As with other nodes, it can contain children; the outermost element of the document is simply a child of the document node. The document can have other children; comments and processing instructions can precede or succeed the document element and appear in the proper order as children of the document. The document type declaration is also represented as a child of the document.

The W3C was careful to specify the DOM in a language-independent way, and each programming language has its own way to present the interfaces to the application programmer; each of these mappings of the DOM into the idioms of the target language is called a binding of the DOM. The W3C includes bindings for Java and ECMAScript as part of the DOM specifications. For Python, the official source of the DOM binding is the Python XML-SIG. The binding developed by the SIG members has been documented in the Python Library Reference, which is part of the standard documentation package for Python. Reference material for the DOM has been included in Appendix D of this book, but the standard documentation should be considered the authoritative document for this binding.

The DOM specifications provide the interfaces as CORBA IDL modules and Java interfaces, but does not specify (or even recommend) that the language-specific IDL mappings adopted by the Object Management Group (OMG) be used. In fact, the Java interfaces provided by the W3C do not match the IDL-to-Java mapping. For Python, the XML-SIG decided that a somewhat more Python-friendly mapping would be used, with some concessions made to the IDL-to-Python mapping. Since no one seems to be using the IDL-derived form of the binding, we cover only the Python-centric version of the DOM binding in this book.

Get Python & XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.