Posted on by & filed under epub, epub3, geek.

EPUB 3 is tricky to experiment with today. Like any brand-new specification, there aren’t many of the resources we often take for granted, from books to software to validation tools. However, if you’re already comfortable getting your hands dirty you can get meaningful validation for your EPUB 3 documents now. In the future, we’ll probably have a dedicated EPUB 3 validation tool (modeled somewhat on epubcheck, although with quite a few changes, I hope), but I’d like to start working today. This post outlines how.

Note: I’m going to give examples using a number of bare-metal tools available on Mac OS X. These are probably portable to Linux and even Windows if you were motivated, but I’m not going to explain how to install them or set them up (here or in the comments). Google is your friend.

To get started, download all of the EPUB 3 schemas (I put them in an epub30-schemas/ directory), install the absolute latest version of the RELAX NG validator Jing (jing-20091111/ for me), download the Schematron tools at iso-schematron-xslt1.zip is for XSLT1 processors (iso-schematron for me), and make sure you’ve got access to both xsltproc and java. Finally, save this as svrl_as_text.xsl:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
            xmlns:svrl="http://purl.oclc.org/dsdl/svrl" 
            version="1.0">
  <xsl:output method="text"/>

  <xsl:template match="*|node()">
    <xsl:apply-templates/>
  </xsl:template>
  <xsl:template match="svrl:failed-assert|                      
                       svrl:successful-report">
    <xsl:text>FAILURE: </xsl:text>
    <xsl:value-of select="local-name(.)"/>
    <xsl:text>: </xsl:text>
    <xsl:value-of select="normalize-space(svrl:text/text())"/>
    <xsl:text>
</xsl:text>
  </xsl:template>
</xsl:stylesheet>

Layout of the EPUB 3 schemas

All of the schemas for EPUB 3 are available as RELAX NG and sometimes Schematron. Each one has specific strengths, so we use both schemas whenever possible to get a complete list of all the validation issues. The EPUB 3 schemas are broken into separate files for each type of document inside an EPUB 3. You should notice that there is a RELAX NG Compact .rnc file for each type:

Unsurprisingly, you use a RELAX NG validator with epub30-schemas/media-overlay-30.rnc to validate a Media Overlay document.

A few of these documents also have a Schematron schema with the same prefix but ending with .sch, which is used to express other requirements that aren’t possible in RELAX NG:

There are some standalone Schematron validators, but we’re actually going to roll our own tool for more human-readable output.

There’s a third file extension too, .nvdl, which is short for Namespace-based Validation Dispatching Language. These files are supposed to wrap these two schemas together for unified validation tools, but there isn’t good software support for NVDL today. Ignore the .nvdl files for now.

What to validate

I’m currently interested in the EPUB Navigation Document, a reformulation of EPUB’s NCX document as XHTML, so these are the examples we’ll use. However, this approach should work for any of the other document types if you go through the same setup.

Here is a valid, if short, EPUB Navigation Document:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" 
      xmlns:epub="http://www.idpf.org/2007/ops"
      profile="http://www.idpf.org/epub/30/profile/content/">
  <head>
    <title>EPUB Navigation Document Example (Good)</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  </head>
  <body>
    <section class="frontmatter TableOfContents">
      <header>
        <h1>Contents</h1>
      </header>
      <nav epub:type="toc" id="toc">
        <ol>
          <li class="toc-prelin" id="toc-prelim">
            <a href="prelims.html">Introduction</a>
          </li>
          <li class="toc-ch01" id="toc-ch01">
            <a href="ch01.html">Chapter 1</a>
          </li>
          <li>
            <a href="copyright.html">Copyright Page</a>
          </li>
        </ol>
      </nav>
      <nav epub:type="landmarks" id="guide">
        <h2>Guide</h2>
        <ol>
          <li>
            <a epub:type="toc" href="#toc">Table of Contents</a>
          </li>
          <li>
            <a epub:type="bodymatter" href="chapter_001.xhtml">Begin Reading</a>
          </li>
          <li>
            <a epub:type="copyright-page" href="copyright.xhtml">Copyright Page</a>
          </li>
        </ol>
      </nav>
    </section>
  </body>
</html>

And here is one with a few errors that should be reported as invalid:


<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" 
      xmlns:epub="http://www.idpf.org/2007/ops"
      profile="http://www.idpf.org/epub/30/profile/content/">
  <head>
    <title>EPUB Navigation Document Example (Bad)</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  </head>
  <body>
    <section class="frontmatter TableOfContents">
      <header>
        <h1>Contents</h1>
      </header>
      <nav>
        <!-- this is omitted, which is invalid: epub:type="toc" id="toc" -->
        <ol>
          <li class="toc-prelin" id="toc-prelim">
            <a href="prelims.html">Introduction</a>
          </li>
          <li class="toc-ch01" id="toc-ch01">
            <a href="ch01.html">Chapter 1</a>
            <!-- This is invalid -->
            <span/>
          </li>
          <li>
            <a href="copyright.html">Copyright Page</a>
          </li>
        </ol>
      </nav>
      <nav epub:type="landmarks" id="guide">
        <h2>Guide</h2>
        <ol>
          <li>
            <a epub:type="toc" href="#toc">Table of Contents</a>
          </li>
          <li>
            <a epub:type="bodymatter" href="chapter_001.xhtml">Begin Reading</a>
          </li>
          <li>
            <a epub:type="copyright-page" href="copyright.xhtml">Copyright Page</a>
          </li>
        </ol>
      </nav>
    </section>
  </body>
</html>

RELAX NG validation with Jing

Once you’ve got jing setup, it’s pretty straightforward to validate our files (above) against the appropriate .rnc. We’ll be using the epub30-schemas/epub-nav-30.rnc schema.

When you run jing against a file and it passes, you get no output (good) and an exit code of 0. I’m calling jing as java -jar jing-20091111/bin/jing.jar, passing the -c flag to tell it to expect a Compact version of RELAX NG, and then the schema filename followed by the filename of the document to validate:

Unlike earlier versions of jing, the latest versions have much clearer error reports on invalid documents (we also saw this improvement in epubcheck 1.2 thanks to George Bina from oXygen):

…and the exit code is not 0, just as expected:

We can take apart that first bit out output, bad.nav.html:21:20, to know which file had the error (we could run it on many at once) and also the line number (21) and character on that line (20). Line 21 has just what we would expect given the error message (a span instead of another ol or the end of this one), but for other errors it can be quite illuminating:

Note: For really large documents, you may get an java.lang.OutOfMemoryError or other exception. Find out how to give jing more “heap space”.

Schematron validation with XSLT

Validating the Schematron schemas is a little more involved, but it catches some validation errors than jing and RELAX NG just cannot find. First we turn the .sch file into a re-usable XSLT stylesheet that produces Schematron Validation Report Language (SVRL). We can then run that stylesheet on any document of that type inside an EPUB 3 file to produce SVRL, which we then transform into something human-readable.

First we create our validation stylesheet, epub-nav-30.sch.xsl, from the epub30-schemas/epub-nav-30.sch Schematron schema:

Now we can use epub-nav-30.sch.xsl on any EPUB Navigation Document:

<?xml version="1.0" standalone="yes"?>
<svrl:schematron-output xmlns:svrl="http://purl.oclc.org/dsdl/svrl" xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:schold="http://www.ascc.net/xml/schematron" xmlns:sch="http://www.ascc.net/xml/schematron" 
xmlns:iso="http://purl.oclc.org/dsdl/schematron" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" title="" schemaVersion="">
  <!--   
		   
		   
		 -->
  <svrl:ns-prefix-in-attribute-values uri="http://www.w3.org/1999/xhtml" prefix="html"/>
  <svrl:ns-prefix-in-attribute-values uri="http://www.idpf.org/2007/ops" prefix="epub"/>
  <svrl:active-pattern id="nav-ocurrence" name="nav-ocurrence"/>
  <svrl:fired-rule context="html:body"/>
   ... many more lines ...

…but we rarely want to read the SVRL it outputs directly (although sometimes it is worth it for the extra detail it contains), so we need to send it through another stylesheet (svrl_as_text.xsl from above) to get a human-readable output:

These are completely new issues that jing could not catch. Note that the issue about the span is actually distinct from the one above, which said it was in the wrong place, whereas this says that it has the wrong content (none at all, in fact).

Unlike jing, we don’t get meaningful exit codes. Although that is not too hard to add, it’s slight tricky to get all of the errors and exit codes rather than just exiting on the first one, which can make you think your document is less invalid than it really is. We still get no output for valid documents:


I’m certain to have made lots of mistakes in the examples above. If you spot some, please let me know in the comments and I’ll correct the post.

Tags:

5 Responses to “Validating EPUB 3 experiments”

  1. Nic Gibson

    Hey Keith, nice article. Have you thought about wrapping all of the code up in xproc? I’ve been thinking about doing something similar but time is not on my side.

  2. Keith Fahlgren

    @Nic It’s really easy to get XProc going for a yes/no validation scenario, but I had a very hard time getting it to do the sort of plain-text display of errors like the above.

  3. Romain Deltour

    Plain-text display is not easy with XProc because:
    1. the standard p:validate-with-relax-ng has a yes/no behavior, you cannot get a report out of it
    2. only XML can flow through an XProc pipeline

    A quick workaround to #1 is to call Jing as an external executable.

    For issue #2, you can either collect all the messages in a single XML document with only one root element, or you can use p:for-each and Calabash px:message extension step to report the message to the error stream.

    Here’s a sample XProc file that implements the latter approach:
    http://pastebin.com/uUThfsYm

    It requires the Jing jars to be added to a ‘lib’ directory and EPUB 3 schemas in ‘schemas’. Run with the following command line to filter Calabash logging statements:

    calabash epub3-validator.xpl doc=samples/sample-nav.xml 2>&1 | grep Message:

    Hope this helps ;)

  4. Nic Gibson

    @Keith and @Romain – the yes/no validation is a serious issue, I agree. I think it’s a major missed opportunity on the part of the XProc committee to have not defined an XML output from the validator steps. I’ve been looking at conversion of schemas to schematron for exactly that reason (Rick Jelliffe blogged about that some time ago iirc).

    I guess I was more thinking along the lines of using XProc to wrap up the various EPUB checking steps that don’t require validation – schematrons, link checking, etc.

  5. Dave Cramer

    Hi Keith,

    The Schematron tools included in the epub revision source seem to work, and provide text output for the Schematron tests. The error messages are pretty bad (no line numbers, etc.) but it’s an improvement over SVRL, I think… The process is:

    [1] schematronDispatcher.xsl
    [2] iso-schematron-abstract.xsl
    [3] iso-schematron-message.xsl

    So far this seems to work for all the Schematron files–epub-xhtml, epub-nav, package, and media-overlay.

    I may just upload the final files to the google code site. No one wants to run through this process for fun!