Posted on by & filed under ebooks, epub.

If I say that a document is in “XML”, I’m not really saying anything very specific. All I’ve told you is that the document has some text wrapped in various angle-brackets, and that those angle-brackets are “well-formed.” A well-formed XML document just means one in which the angle-brackets open and close in a predictable way.

It doesn’t tell you anything about the information encoded in those angle-brackets (really called elements). If the element is called <i>, does that mean “put this text in italics”? Or “indent”? Or even, “The following text is about me”?

In order to know what an XML document actually means, you need to know its schema. A schema is a kind of dictionary that defines all the names of the elements and to some extent, what they mean. It also describes the grammar of the document: for example, we might say that a <chapter> can be inside a <book> but not the other way around.

You can make up your own schema, and that’s often advisable when modeling a unique business practice. But books and other kinds of literature are well-understood, and there’s already been a huge amount of thought put into how to properly model them in XML. If you’re in digital publishing, these are the three schemas you’re most likely to come across when modeling written works:


Originally designed for technical books, DocBook has emerged as an excellent general-purpose book schema. Because it’s in wide use, there are a lot of modern tools that understand it (including the excellent oXygen XML editor), and it’s trivial to generate other formats, including PDF and HTML, from a DocBook source.

Here’s a really simple DocBook document, in this case describing an article rather than a whole book:

<?xml version="1.0" encoding="utf-8"?>
<article xmlns="" version="5.0" xml:lang="en">
  <title>Sample article</title>
  <para>This is a very short article.</para>


The Text Encoding Initiative is also used to model textual works, but supports methods to encode historical and academic texts. TEI allows document authors to include revision history, extensive footnoting and cross-references, and provides a rich tagging mechanism for poetry, drama, and other forms of human literature.

TEI is frequently used in library digitization and archiving projects, and it can be used to encode texts that might seem otherwise impossible to render in XML.


In lots of ways, XHTML is wholly unsuited for use in book content. XHTML has almost no semantically-meaningful elements as applied to literature — there’s no built-in way to indicate a chapter, or footnotes, or dialogue versus description.

The advantage it does have is that it’s ubiquitous — thanks to the web — and many people who otherwise have no experience in XML or text encoding know at least a little HTML. Because of the web there are probably more works written in HTML today than in any other form in history.

By supplementing it with other forms of XML that do provide semantic structure, as in ePub, XHTML is demonstrably a useful and important commercial format.


8 Responses to “Three useful XML schemas in publishing”

  1. Brad Scott

    I’ve used all three in my time, but do have a soft spot for TEI, which I have used when designing data for a number of publishers, including the MLA Handbook, Palgrave Dictionary of Economics, and the Statesman’s Yearbook. It’s very flexible and certainly makes it easier to get results more quickly than trying to reinvent the wheel.

    Funnily enough I’d only just read the post on the History Compass blog about the Holinshed site, also using TEI and in particular the Comparator.

  2. John Maxwell

    Of XHTML’s lack of “semantically meaningful elements,” wouldn’t you agree that the “class” attribute—which is typically used to provide stylesheet hooks—can be usefully employed to denote semantic structures? Any reason why we really shouldn’t be doing that?

  3. liza

    John: Sure, but there’s no “built-in” way to do that. Of course it’s totally possible to use class attributes that way, but without a constrained vocabulary, different content producers will do it differently. That’s only bad insofar as it prevents easy interchange.

  4. keith

    It may turn out that the XML way of writing HTML5 allows publishers to get a little bit more semantics than they currently get from XHTML 1.0 without sacrificing broad ubiquity.

  5. Adrienne Adams

    I have just discovered your blog at the end of a full day of researching ebook publishing. I’m a web designer/frontend developer, and am keenly interested in interoperable publishing formats; delighted, as well, to see that some of my expertise with XHTML and CSS can be applied to e-publishing. I look forward to keeping up with your posts and learning more about this subject.