XML Basics

Markup technology has a long and rich history. In the 1960s, while developing an integrated document storage, editing, and publishing system at IBM, Charles Goldfarb, Edward Mosher, and Raymond Lorie devised a text-based markup format. It extended the concepts of generic coding (block-level tagging that was both machine-parsable and meaningful to human authors) to include formal, nested elements that defined the type and structure of the document being processed. This format was called the Generalized Markup Language (GML). GML was a success, and as it was more widely deployed, the American National Standards Institute (ANSI) invited Goldfarb to join its Computer Languages for Text Processing committee to help develop a text description standard-based GML. The result was the Standard Generalized Markup Language (SGML). In addition to the flexibility and semantic richness offered by GML, SGML incorporated concepts from other areas of information theory; perhaps most notably, inter-document link processing and a practical means to programmatically validate markup documents by ensuring that the content conformed to a specific grammar. These features (and many more) made SGML a natural and capable fit for larger organizations that needed to ensure consistency across vast repositories of documents. By the time the final ISO SGML standard was published in 1986, it was in heavy use by bodies as diverse as the Association of American Publishers, the U.S. Department of Defense, and the European Laboratory for Particle Physics (CERN).

In 1990, while developing a linked information system for CERN, Tim Berners-Lee hit on the notion of creating a small, easy-to-learn subset of SGML. It would allow people who were not markup experts to easily publish interconnected research documents over a network—specifically, the Internet. The Hypertext Markup Language (HTML) and its sibling network technology, the Hypertext Transfer Protocol (HTTP) were born. Four years later, after widespread and enthusiastic adoption of HTML by academic research circles throughout the globe, Berners-Lee and others formed the World Wide Web Consortium (W3C) in an effort to create an open but centralized organization to lead the development of the Web.

Without a doubt, HTML brought markup technology into the mainstream. Its simple grammar, combined with a proliferation of HTML-specific markup presentation applications (web browsers) and public commercial access to the Internet sparked what can only be called a popular electronic markup publishing explosion. No longer was markup solely the domain of information technology specialists working with complex, mainframe-based publishing tools inside the walls of huge organizations. Anyone with a home PC, a dial-up Internet account, and patience to learn HTML’s intentionally forgiving syntax and grammar could publish his own rich hypertext documents for the rest of the wired world to see and enjoy.

HTML made markup popular, but it was a single, predefined grammar that only indicated how a document was to be presented visually in a web browser. That meant much of the flexibility offered by markup technology, in general, was simply lost. All the markup reliably communicated was how the document was supposed to look, not what it was supposed to mean. In the mid-1990s, work began at the W3C to create a new subset of SGML for use on the Web—one that provided the flexibility and best features of its predecessor but could be processed by faster, lighter tools that reflected the needs of the emerging web environment. In 1996, W3C members Tim Bray and C. M. Sperberg-McQueen presented the initial draft for this new “simplified SGML for Web”—the Extensible Markup Language (XML). Two years later in 1998, after much discussion and rigorous review, the W3C published XML 1.0 as an official recommendation.

In the six years since, interest in XML has steadily grown. While not as ubiquitous as some claim, tools to process XML are available for the most popular programming languages, and XML has been used in some fairly novel (though sometimes not always appropriate) ways. Given its generic nature, inherent flexibility, and ways in which it has (or can be) used, XML is hard to pigeonhole. It remains largely an enigma to many developers. At its core, XML is nothing, more or less, than a text-based format for applying structure to documents and other data. Uses for XML are (and will continue to be) many and varied, but looking back at its history helps to provide a reasonable context—a history inextricably bound to automated document publishing.

Many people, especially those coming to XML from a web-development background, seem to expect that it is either intended to replace HTML or that it is somehow HTML: The Next Generation—neither is the case. Although both are markup languages, HTML defines a specific markup grammar (set of elements, allowed structures) intended for consumption by a single type of application: an HTML web browser. XML, on the other hand, does not define a grammar at all. Rather, it is designed to allow developers to use (or create) a grammar that best reflects the structure and meaning of the information being captured. In other words, it gives you a clear way to create the rich, reusable source content crucial to modern adaptive web-publishing systems.

To understand the value of using a more semantically meaningful markup grammar, consider the task of publishing a poetry collection. If you know HTML and want to get the collection onto the Web quickly, you could create a document, such as the one shown in Example 1-1, for each poem.

Example 1-1. poem.html

<html>
  <head>
    <title>Post-Geek-chic Folk Poetry Collection</title>
  </head>
  <body>
  <h1>An Ode To Directed Acyclic Graphs</h1>
  <p><i>by: Anonymous</i></p>
  <p>
   I think that I shall never see, <br>
   a document that cannot be represented as a tree.
  </p>
  </body>
</html>

If your only goal is to publish your poetic gems on the Web for people to view in a browser, then once you upload the documents to the right location on an appropriate server somewhere, the job is done. What if you want to do more? At the very least, you will probably want an index document containing a list of links to the poems in your collection. If the collection remains small and time is not a consideration, you could create this index by hand. More likely, though, because you are a professional web developer, you would probably create a small script to extract information (title and author) from the poems themselves to create the index document programatically. That’s when the weakness in your approach begins to show. Specifically, using HTML to mark up your poetry only gave you a way to present the work visually. In your attempt to extract the title and author’s name, you are forced to impose meaning based solely on inference and your knowledge of the conventions used when marking up the poems. You can infer that the first h1 element contains the title of the poem, but nothing states this explicitly. You must trust that all poems in the collection will follow the same structure. In the best case, you can only guess and hope that your guess holds up in the long run.

Marking up your poetry collection in XML can help you avoid such ambiguities. It is not the use of XML, per se, that helps. Rather, XML gives you a familiar syntax (nested angle-bracketed tags with attributes, such as those in HTML) while offering the flexibility to choose a grammar that more intimately describes the structure and meaning of the content. It would help simplify your indexing script, for example, if something like an author element contained the author’s name. You would not have to rely on an unstable heuristic such as “the string that follows the word `by,’ optionally contained in an i element, that is in the first p element after the first h1 element in the document” to extract the data. Essentially, you want to use a more exact, domain-specific grammar whose structures and elements convey the meaning of the data. XML provides a means to do that.

Not surprisingly, marking up poetic content is a task that others before you have faced. A quick web search reveals several XML grammars designed for this purpose. A short evaluation of each reveals that the poemsfrag Document Type Definition (DTD) from Project Gutenberg (a volunteer effort led by the HTML Writer’s Guild to make the World’s great literature available as electronic text) fits your needs nicely. Using the grammar defined by poemsfrag.dtd, the sample poem from your collection takes the form shown in Example 1-2.

Example 1-2. poem.xml

<?xml version="1.0"?>
<poem>
  <title>An Ode To Directed Acyclic Graphs</title>
  <author>Anonymous</author>
  <verse>
    <line>I think that I shall never see,</line>
    <line>a document that cannot be represented as a tree.</line>
  </verse>
</poem>

Using this more specific grammar makes extracting the title and author data for the index document completely unambiguous—you simply grab the contents of the title and author elements, respectively. In addition, you can now easily generate other interesting metadata, such as the number of verses per poem, the average lines per verse, and so on, without dubious guesswork. Moreover, having an explicit, concrete Document Type Definition that describes your chosen grammar provides the chance to programatically validate the structure of each poem you add to the collection. This helps to ensure the integrity of the data from the outset.

Tip

Choosing the best grammar (or data model, if you must) for your content is crucial: get it right and the tools to process your documents will grow logically from the structure; get it wrong and you will spend the life of the project working around a weak foundation. Designing useful markup grammars that hold up over time is an art in itself; resist the urge to create your own just because you can. Chances are there is already a grammar available for the class of documents you will mark up. Evaluate what’s available. Even if you decide to go your own way, the time spent seeing how others approached the same problem more than pays for itself.

Switching to XML and the poemsfrag grammar arguably adds significant value to your documents—the structure reveals (or imposes) the intended meaning of the content. At the very least, this reduces time wasted on messy guessing both for those marking up the poems and for those writing tools to process those poems. However, you lose something, as well. You can no longer simply upload the documents to a web server and expect browsers to do the right thing when rendering them (as you could when they were marked up as HTML). There is a gap between the grammar that is most useful to us, as authors and tool builders, and the grammar that an HTML web browser expects. Since publishing your poetry online was the goal in the first place, unless you can bridge that gap (and easily too), then really, you take a step backward.

Get XML Publishing with AxKit now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.