Chapter 4. Constraining XML

Learning to use XML, both for data representation and within Java applications, is an iterative process. In fact, almost every time you learn something about XML or one of its sister technologies, you will find that it gives you tools to learn yet another subset of the XML picture. Because there are so many XML-related projects and specifications, you will be hard-pressed to “know all there is to know” about XML; and just when you think you do, new versions of things you had down will come out, and you will get to start all over again! However, the more you do understand about the various components that make up the XML technology space, the better equipped you will be to add additional components to your programming toolkit. In keeping with this idea, we will now drop out of the Java programming language and return to XML-related specifications.

Chapter 2 and Chapter 3 should have given you the information and skills to create a well-formed XML document and then manipulate that document to a limited degree within Java. You also should begin to have a basic idea of how XML documents are parsed, and how the SAX Java classes aid in this process. In this chapter, we will discuss constraining the XML documents we have been creating. We will look at how Java can use these constraints in the parsing process in the next chapter.

Why Constrain XML Data?

Before assuming that you want to know about DTDs and XML Schema, it is only fair to help you understand why we should spend time on these specifications. There are some XML users and technologists who argue that there is never a need for constraining XML and ensuring document validity. Remember, we have already said that an XML document that is valid meets all the constraints that are set upon the document in the referenced DTD or schema. Also recall that a document can be well-formed, but still not be valid. So why go to the trouble to create a DTD or schema that does nothing but impose additional rules on your XML data?

Self-Documentation

As a Java developer, you have hopefully had lots of experience commenting your code, both with Javadoc and inline comments. At some point in your career, you were probably lectured on the importance of these comments; someone may have to read your code, someone may have to maintain your code, someone may actually have to understand your code. If you are involved in open source projects, the importance of commenting rises to even higher levels. And at some point, you probably rushed a project to completion to meet tight deadlines, and weren’t exactly verbose in your comments. Then about three months later, another developer left with the task of supporting your project came to you and asked what this block of code did, or how that task was accomplished. Hopefully, you rattled off the correct explanation, but more likely you looked at him blankly and couldn’t remember how you managed that particular feat of coding wizardry. At that point, you learned the value of documentation.

Now XML data is certainly not code, and simply because of the element nesting and other syntactical rules, it is almost always easier to understand than a snippet of complex Java code. However, don’t be so sure that your outlook on data representation is the same outlook that other content authors may have. The simple XML file in Example 4.1 is an excellent example.

Example 4-1. An Ambiguous XML File

<?xml version="1.0"?>

<page>
  <screen>
    <name>Commerce</name>
    <trimColor>#CC9900</trimColor>
    <fontFace>Arial</fontFace>
  </screen>
  <content>
    <p>Lots of content would go here</p>
  </content>
</page>

The purpose of the file in Example 4.1 seems abundantly clear. It gives information to an application about a particular screen to render to a client. The color of the page trim is given, as well as the font to use, and then content for the screen is included. Where is the ambiguity? Well, it only shows up when another XML document used within the same application is seen, as in Example 4.2.

Example 4-2. A Less Ambiguous XML File

<?xml version="1.0"?>

<page>
  <screen>
    <name>Commerce</name>
    <trimColor>#CC9900</trimColor>
    <fontFace>Arial</fontFace>
  </screen>
  <screen>
    <name>Message Center</name>
    <trimColor>#9900FF</trimColor>
    <fontFace>Arial</fontFace>
  </screen>
  <screen>
    <name>News Center</name>
    <trimColor>#EECCEE</trimColor>
    <fontFace>Helvetica</fontFace>
  </screen>
  <content>
    <p>Lots of content would go here</p>
  </content>
</page>

Suddenly our interpretation of the first XML file would seem to be invalid. The screen element cannot represent the current screen, as the second example has three screen elements. In actuality, the application is rendering links to available screens at the top of the page, and the screen elements denote what each of these links should look like; the name of the link, the color of the section, and the font face of the link’s title. The first example happened to have only one screen to link to, creating confusion. Only the content author or application developer could look at the first XML document and know this.

Constraining XML documents can aid in documenting these confusing situations. If we knew that there was only one allowed screen element within an XML page, we could safely make our first assumption at the use of the screen element. However, if we knew that multiple screen elements were allowed, then even with the first XML document, we could make a better estimation of the purpose of the data. To put it another way, a well-formed XML document contains words that are all found in the dictionary. The words have meaning, but can be used in meaningless ways: “Fox cat run happily smear bread jelly down.” Validity ensures that these “words” (elements and attributes in XML) are put together in ways that make sense: “Foxes and cats happily run toward the bread smeared with jelly.”

Documenting the “correct,” or “valid,” combinations of elements and attributes is the job of the DTD or schema. This is an important use of DTDs and schemas, in that they offer self-documentation of XML data in a meaningful way (and one that you can remember when your co-worker wants to know what your XML data means!).

Portability

In addition to helping viewers of your XML documents understand how and what data is being represented, constraining XML aids other applications in understanding XML data. We touched on this earlier; given any two arbitrary applications, the two cannot be assumed to have shared resources. In other words, the program that created an XML document for one application may not be available to the other application, hiding the logic by which data was generated in XML. This leaves the second application with the task of determining what type of data is being received in a transmitted XML document. Without any aid, the second application can only make assumptions about what is meant, often incorrectly.

This is somewhat similar to the problems that the C language has had, and that Java has tried to remedy. Because it defines a platform-independent programming language and relies on no native code, Java has become the most portable programming language available today. This is because there is a set of constraints put upon what Java can do, and these constraints are available to all platforms; while implementation details for tasks such as garbage collection and thread management are left to the specific platform, the interface to those tasks is always the same for the application developer.

Constraining XML documents with DTDs or schemas provides an analogous portability in XML. Consider our original example in this section: if the second application could access a resource that described the allowable formats of the data it is receiving, it could process that data with an XML-based set of utilities. Because the constraints of the document are not coded directly into the application (either the first or the second), there is no application logic that would have to be changed if the format of the document changed. The DTD or schema would change, but because this is simply a textual constraint file, neither application would have to be modified to immediately utilize the document structure changes. This allows XML data to be portable without having to resort to application-specific code, similar to the native code we try to avoid in Java programs.

Whether it is for documentation purposes, portability across applications and systems, or just because it allows a stricter checking of XML data, constraining XML is almost always a good idea. The only group whose view is not addressed here is the group that would say the performance hit taken for validating XML is greater than the gain from more structured data. This is a sound point; validating data does take additional processing time. However, many good publishing frameworks, such as the Apache Cocoon project, allow the specification of whether to validate a document or not. This means that development and testing can be performed with validation turned on. Then, once a document’s structure is sound and tested, the framework can be told to not validate the document. Applications receiving this data can choose in a similar fashion if they want to validate the document or not, as the document will still contain a reference to a DTD or schema for which it is valid. In this way, the benefits of validation can be gained without additional processing time. Consult the vendor of any XML framework you consider using to see if this feature is supported.

In production systems, validation provides value in business-to-business applications; validation can ensure that data received from other applications, often ones you have no control over, is correctly formatted. This can help avoid errors in your application resulting from erroneous data input. For all of these purposes, DTDs and schemas are invaluable.

Get Java and XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.