Changing Documents

Now that we’ve looked at how we can extract information from our documents using the DOM, we probably want to be able to change them. There are really just a few things we need to know to make changes, so we describe the basic operations and then show a few examples. The basic operations involved in modifying a document center around creating new nodes, adding, moving, and removing nodes, and modifying the contents of nodes. Since we often want to add new elements and textual content, we start by looking at creating new nodes.

Creating New Nodes

Most of the time, new nodes need to be created explicitly. Since the DOM is defined as a set of interfaces rather than as concrete classes, the only way to create new nodes is to make call methods on the objects we already have in hand. Fortunately, the Document interface includes a large selection of factory methods we can use to create new nodes of most types. (Methods for creating entity and notation nodes are noticeably absent, but most applications should not find themselves constrained by that.)

The most used of these factory methods are very simple, and are used to create new element and text nodes. For elements, use the createElement method, with the tag name of the element to create as the only parameter. Text nodes can be created using the createTextNode method, passing the text of the new node as the parameter. For the details on the other node factory methods, see the reference material in Appendix D.

Adding and Moving Nodes

There are some very handy methods available for moving nodes to different locations on the tree. These methods appear on the basic Node interface, so all DOM nodes provide these. There are constraints on the use of these nodes: you cannot use them to construct documents which do not make sense structurally, and well-formedness of the document is ensured at all times. For example, an exception is raised if you attempt to add a child to a text node, or if you try to add a second child element to the document object.

appendChild( newChild )

Takes a newChild node argument and appends it to the end of the list of children of the node.

insertBefore( newChild , refChild )

Takes the node newChild and inserts it immediately before the refChild node you supply.

replaceChild( newChild , oldChild )

Replaces the oldChild with the newChild, and oldChild is returned to the caller.

removeChild( oldChild )

Removes the node oldChild from the list of children of the node this is called on.

The brief descriptions do not replace the reference documentation for these methods; see Appendix D for more complete information.

Removing Nodes

Let’s look at how to examine a tree, and how to remove specific nodes on the tree. Example 4-4 uses a few nested loops to dive three levels deep into an XML document created using the index.py script from Example 3-4. The design has its limitations, as it assumes you are only dealing with elements no more than three levels deep, but demonstrates the DOM methods we’re interested in.

Example 4-4. domit.py

#!/usr/bin/env python
import sys

from xml.dom.ext.reader.Sax2 import FromXmlStream
from xml.dom.ext             import PrettyPrint

# get DOM object
doc = FromXmlStream(sys.stdin)

# remove unwanted nodes by traversing Node tree

for node1 in doc.childNodes:
  for node2 in node1.childNodes:
    node3 = node2.firstChild
    while node3 is not None:
      next = node3.nextSibling
      name = node3.nodeName
      if name in ("contents", "extension", "userID", "groupID"):
        # remove unwanted nodes here via the parent
        node2.removeChild(node3)
      node3 = next

PrettyPrint(doc)

After getting a document from standard input, a few nested for loops are executed to descend three levels deep into the tree and look for specific tag names. When running the script against the XML document we created with index.py, your file elements should look like this:

<file name='c:\windows\desktop\G-Force\G-Force.doc'>


        <size>12570</size>
        <lastAccessed>Tue May 09 00:00:00 2000</lastAccessed>
        <lastModified>Tue May 09 11:56:14 2000</lastModified>
        <created>Wed Jan 17 23:31:23 2001</created>


</file>

The whitespace around the removed elements remains in place as you can see by the gaps between elements; we did not look for adjacent text nodes, so they remain unaffected. This text was the result of a call to the PrettyPrint function at the end of the script. Of course, the element looks the same regardless of hierarchical position within the document. When writing DOM processing code, you should try to keep it independent from the structure of the document. Instead of using firstChild to get what you’re after, consider enumerating the children and examining each one. This may cost some processing time, but it does give the document’s structure more flexibility. As long as the target element appears beneath the parent node, the child will be found. When you use firstChild, you might be setting yourself up for trouble if someone gives you a document with a slightly different structure, such as a peer element coming before another in the document. You can write this type of operation using a recursive function, so that you can handle similar structures, regardless of position in the document. If you really don’t care where within the subtree an element is found, you can use the getElementsByTagName method described earlier.

Another common requirement is to locate a node that you know must be a child of a particular node, but not require a specific ordering of the child nodes. A simple loop in a utility function handles this nicely:

from xml.dom import Node

def findChildrenByTagName(parent, tagname):
  """Return a list of 'tagname' children of 'parent'."""
  L = []
  for child in parent.childNodes:
    if (child.nodeType == Node.ELEMENT_NODE
        and child.tagName == tagname):
      L.append(child)
  return L

An even simpler helper function that can come in handy is a function that finds the first child element with a particular tag name, or the first to have one of several tag names. These are all minor variations of the function just presented.

Changing a Document’s Structure

In addition to doing replacements and additions, you can also restructure a document entirely using the DOM.

In Example 4-5, we take the nested loops from the last section, and replace them with a traveling recursive function. The script can also work with XML output from the index.py script we worked with earlier in this chapter. In this version however, the file element’s size child is used as a replacement for itself. This process leaves the document filled with directory and size elements only.

Example 4-5 shows domit2.py using a recursive function.

Example 4-5. domit2.py

#!/usr/bin/env python

from xml.dom.ext.reader.Sax2 import FromXmlStream
from xml.dom.ext             import PrettyPrint

import sys

def makeSize(nodeList):
  for subnode in nodeList:
    if subnode.nodeType == subnode.ELEMENT_NODE:
      if subnode.nodeName == "size":
        subnode.parentNode.parentNode.replaceChild(
          subnode, subnode.parentNode)
      else:
        makeSize(subnode.childNodes)

# get DOM object
doc = FromXmlStream(sys.stdin)

# call func
makeSize(doc.childNodes)

# display altered document
PrettyPrint(doc)

You can run the script from the command line:

$> python domit2.py < wd.xml

The file wd.xml is an XML file created with the index.py script—you can use any file you like, as long as has the same structure as the files created by index.py. The output should be something like this:

<Directory name='c:\windows\desktop\gl2'>
<size>230444</size>
    <size>3035</size>
    <size>8904</size>
    <size>722</size>
    <Directory name='c:\windows\desktop\gl2/Debug'>
<size>156672</size>
      <size>86016</size>
      <size>3779068</size>
      <size>25685</size>
      <size>17907</size>
      <size>250508</size>
      <size>208951</size>
      <size>402432</size>
    </Directory>
<size>3509</size>
    <size>33792</size>
    <size>722</size>
    <size>48640</size>
    <size>533</size>

  </Directory>

Get Python & XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.