Retrieving Information

Retrieving information from a document is easy using the DOM. Most of the work lies in traversing the document tree and selecting the nodes that are actually interesting for the application. Once that is done, it is usually trivial to call a method of the node (or nodes), or to retrieve the value of an attribute of the node. In order to extract information using the DOM, however, we first need to get a DOM document object.

Getting a Document Object

Perhaps the most glaring hole in the DOM specifications is that there is no facility in the API for retrieving a document object from an existing XML document. In a browser, the document is completely loaded before the DOM client code in the embedded or linked scripts can get to the document, so the document object is placed in a well-known location in the script’s execution environment. For applications that do not live in a web browser, this approach simply does not work, so we need another solution.

Our solution depends on the particular DOM implementation we use. We can always create a document object from a file, a string, or a URL.

Loading a document using 4DOM

Creating a DOM instance to work with is easy in Python. Using 4DOM, we need call only one function to load a document from an open file:

from xml.dom.ext.reader.Sax2 import FromXmlStream
doc = FromXmlStream(sys.stdin)

Loading a document using minidom

There are two convenient functions in the xml.dom.minidom module that can be used to load a document. The parse function takes a parameter that can be a string containing a filename or URL, or it can be a file object open for reading:

import xml.dom.minidom
doc = xml.dom.minidom.parse(sys.stdin)

Another function, parseString, can be used to load a document from a buffer containing XML text that has already been loaded into memory:

doc = xml.dom.minidom.parseString("<doc>My tiny document.</doc>")

Determining a Node’s Type

You can use the constants built in to the DOM to see what type of node you are dealing with. It may be an element, an attribute, a CDATA section, or a host of other things. (All the node type constants are listed in Appendix D.)

To test a node’s type, compare its nodeType attribute to the particular constant you’re looking for. For example, a CDATASection instance has a nodeType equal to CDATA_SECTION_NODE. An Element (with potential children) has a nodeType equal to ELEMENT_NODE. When traversing a DOM tree, you can test a node at any point to determine whether it is what you’re looking for:

for node in nodes.childNodes:
  if node.nodeType == node.ELEMENT_NODE:
    print "Found it!"

The Node interface has other identifying properties, such as its value and name. The nodeName value represents the tag name for elements, while in a text node the nodeName is simply #text. The nodeValue attribute may be null for elements, and should be the actual character data of a text element or other leaf-type element.

Getting a Node’s Children

When dealing with a DOM tree, you primarily use nodes and node lists. A node list is a collection of nodes. Any level of an XML document can be represented as a node list. Each node in the list can in turn contain other node lists, representing the potential for infinite complexity of an XML document.

The Node interface features two methods for quickly getting to a specific child node, as well as a method to get a node list containing a node’s children. firstChild refers to the first child node of any given node. The interface shows None if the node has no children. This is handy when you know exactly the structure of the document you’re dealing with. If you are working with a strict content model enforced by a schema or DTD, you may be able to count on the fact that the document is organized in a certain way (provided you included a validation step). But for the most part, it’s best to leverage the spirit of XML and actually traverse the document for the data you’re looking for, rather than assume there is logic to the location of the data. Regardless, firstChild can be very powerful, and is often used to retrieve the first element beneath a document element.

The lastChild attribute is similar to firstChild, but returns the last child node of any given node. Again, this can be handy if you know the exact structure of the document you’re working with, or if you’re trying to just get the last child regardless of the significance of that child.

The childNodes attribute contains a node list containing all the children of the given node. This attribute is used frequently when working with the DOM. When iterating over children of an element, the childNodes attributes can be used for simple iteration in the same way that you would iterate over a list:

for child in node.childNodes:
  print "Child:", child.nodeName

The value of the childNodes attribute is a NodeList object. For the purpose of retrieving information from the DOM, it behaves like a Python list, but does not support “slicing.” NodeList objects should not be used to modify the content of the DOM as the specific behaviors may differ among DOM implementations.

The NodeList interface features some additional interfaces beyond those provided by lists. These are not commonly used with Python, but are available since the DOM specifies their presence and behavior. The length attribute indicates the number of nodes in the list. Note that the length returns the total number, but that indexing begins at zero. For example, a NodeList with a length of 3 has nodes at indices 0, 1, and 2 (which mirrors the way an array is normally indexed in Python). Most Python programmers prefer to use the len built-in function, which works properly with NodeList objects.

The item method returns the item at the specific index passed in as a parameter. For example, item(1) returns the second node in the NodeList, or None if there are fewer than two nodes. This is distinct from the Python indexing operation, for which a NodeList raises IndexError for an index that is out of bounds.

Getting a Node’s Siblings

Since XML documents are hierarchical and the DOM exposes them as a tree, it is reasonable to want to get the siblings of a node as well as its children. This is done using the previousSibling and nextSibling attributes. If a node is the first child of its parent, its previousSibling is None; likewise, if it is the last child, its nextSibling is None. If a node is the only child of its parent, both of these attributes are None, as expected.

When combined with the firstChild or lastChild attributes, the sibling attributes can be used to iterate over an element’s children. The required code is slightly more verbose, but is also better suited for use when the document tree is being modified in certain ways, especially when nodes are being added to or removed from the element whose children are being iterated over.

For example, consider how Directory elements could be removed from another Directory element to leave us with a Directory containing only files. If we iterate over the top element using its childNodes attribute and remove child Directory elements as we see them, some nodes are not properly examined. (This happens because Python’s for loops use the index into the list, but we’re also shifting remaining children to the left when we remove one, so it is skipped as the loop advances.) There are many ways to avoid skipping elements, but perhaps the simplest is to use nextSibling to iterate:

child = node.firstChild
while child is not None:
  next = child.nextSibling
  if (child.nodeType == node.ELEMENT_NODE
      and child.tagName == "Directory"):
    node.removeChild(child)
  child = next

Extracting Elements by Name

The DOM can provide some advantages over SAX, depending on what you’re trying to do. For starters, when using the DOM, you don’t have to write a separate handler for each type of event, or set flags to group events together as was done earlier with SAX in Example 3-3. Imagine that you have a long record of purchase orders stacked up in XML. Someone has approached you about pulling part numbers, and only part numbers, out of the document for reporting purposes. With SAX, you can write a handler to look for elements with the name used to identify part numbers (sku in the example), and then set a flag to gobble up character events until the parser leaves the part number element. With the DOM, you have a different approach using the getElementsByTagName method of the Document interface.

To show how easy this can make some operations, let’s look at a simple example. Create a new XML file as shown in Example 4-1, po.xml. This document is the sample purchase order for the next script:

Example 4-1. po.xml

<?xml version="1.0"?>
<purchaseOrder>
  <item>
    <name>Mushroom Lamp</name>
    <sku>229-987488</sku>
    <price>$34.99</price>
    <qty>1</qty>
  </item>
  <item>
    <name>Bass Drum</name>
    <sku>228-988347</sku>
    <price>$199.99</price>
    <qty>1</qty>
  </item>
  <item>
    <name>Toy Steam Engine</name>
    <sku>221-388833</sku>
    <price>$19.99</price>
    <qty>1</qty>
  </item>
</purchaseOrder>

Using the DOM, you can easily create a list of nodes that references all nodes of a single element type within the document. For example, you could pull all of the sku elements from the document into a new list of nodes. This list can be used like any other NodeList object, with the difference that the nodes in the list may not share a single parent, as is the case with the childNodes value. Since the DOM works with the structural tree of the XML document, it is able to provide a simple method call to pull a subset of the document out into a separate node list. In Example 4-2, the getElementsByTagName method is used to create a single NodeList of all the sku elements within the document. Our example shows that sku elements have text nodes as children, but we know that a string of text in the document may be presented in the DOM as multiple text nodes. To make the tree easier to work with, you can use the normalize method of the Node interface to convert all adjacent text nodes into a single text node, making it easy to use the firstChild attribute of the Element class to retrieve the complete text value of the sku elements reliably.

Example 4-2. po.py

#!/usr/bin/env python

from xml.dom.ext.reader.Sax2 import FromXmlStream
import sys

doc = FromXmlStream(sys.stdin)

for sku in doc.getElementsByTagName("sku"):
  sku.normalize(  )
  print "Sku: " + sku.firstChild.data

Example 4-2 requires considerably less code than what is required if you are implementing a SAX handler for the same task. The extraction can operate independently of other tasks that work with the document. When you run the program, again using po.xml, you receive something similar to the following on standard output:

Sku: 229-987488
Sku: 228-988347
Sku: 221-388833

You can see something similar being done using SAX in Example 3-3.

Examining NodeList Members

Let’s look at a program that puts many of these concepts together, and uses the article.xml file from the previous chapter (Example 3-1). Example 4-3 shows a recursive function used to extract text from a document’s elements.

Example 4-3. textme.py

#!/usr/bin/env python

from xml.dom.ext.reader.Sax2 import FromXmlStream
import sys

def findTextNodes(nodeList):
  for subnode in nodeList:
    if subnode.nodeType == subnode.ELEMENT_NODE:
      print "element node: " + subnode.tagName

      # call function again to get children
      findTextNodes(subnode.childNodes)

    elif subnode.nodeType == subnode.TEXT_NODE:
      print "text node: ",
      print subnode.data

doc = FromXmlStream(sys.stdin)
findTextNodes(doc.childNodes)

You can run this script passing article.xml as standard input:

$> python textme.py < article.xml

It should produce output similar to the following:

element node: webArticle
text node:

element node: header
text node:

element node: body
text node:  Seattle, WA - Today an anonymous individual
                announced that NASA has completed building a
                Warp Drive and has parked a ship that uses
                the drive in his back yard.  This individual
                claims that although he hasn't been contacted by
                NASA concerning the parked space vessel, he assumes
                that he will be launching it later this week to
                mount an exhibition to the Andromeda Galaxy.

text node:

You can see in the output how whitespace is treated as its own text node, and how contiguous strings of character data are kept together as text nodes as well. The exact output you see may vary from that presented here. Depending on the specific parser you use (consider different versions or different platforms as different parsers since the buffering interactions with the operating system can be relevant), the specific boundaries of text nodes may differ, and you may see contiguous blocks of character data presented as more than one text node.

Looking at Attributes

Now that we’ve seen how to examine the hierarchical content of an XML document using the DOM, we need to take a look at how we can use the DOM to retrieve XML’s only nonhierarchical component: attributes. As with all other information in the DOM, attributes are described as nodes. Attribute nodes have a very special relationship with the tree structure of an XML document; we find that the interfaces that allow us to work with them are different as well.

When we looked at the child nodes of elements earlier (as in Example 4-3), we only saw nodes for child elements and textual data. From this, we can reasonably surmise that attributes are not children of the element on which they are included. They are available, however, using some methods specific to Element nodes. There is an attribute of the Node interface that is used only for attributes of elements.

The easiest way to get the value of an attribute is to use the getAttribute method of the element node. This method takes the name of the attribute as a string and returns a string giving the value of the attribute, or an empty string if the attribute is not present. To retrieve the node object for the attribute, use the getAttributeNode method instead; if the attribute does not exist, it returns None. If you need to test for the presence of an attribute without retrieving the node or attribute value, the hasAttribute method will prove useful.

Another way to look at attributes is using a structure called a NamedNodeMap. This object is similar in function to a dictionary, and the Python version of this structure shares much of the interface of a dictionary. The Node interface includes an attribute named attributes that is only used for element nodes; it is always set to None for other node types. While the NamedNodeMap supports the item method and length attribute much as the NodeList interface does, the normal way of using it in Python is as a mapping object, which supports most of the interfaces provided by dictionary objects. The keys are the attribute names and the values are the attribute nodes.

Get Python & XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.