Schematron
Schematron takes a different approach from the schema languages we’ve seen so far. Instead of being prescriptive, as in “this element has the following content model,” it relies instead on a series of Boolean tests. Depending on the result of a test, the schema will output some predetermined message.
The tests are based on XPath, which is a very granular and exhaustive set of node examination tools. Relying on XPath is clever, taking much of the complexity out of the schema language. XPath, which is used in places such as XSLT and some implementations of DOM, can scratch an itch that more blunt tools like DTDs can’t reach. As the creator of Schematron, Rick Jelliffe, says it’s like “a feather duster for the furthest corners of a room where the vacuum cleaner (DTD) cannot reach.”
Overview
The basic structure of a Schematron schema is this:
<schema xmlns="http://www.ascc.net/xml/schematron"> <pattern> <rule context="XPath Expression
"> <assert test="XPath Expression
">message
</assert> <report test="XPath Expression
">message
</report>...more tests...
</rule>...more rules...
</pattern>...more patterns...
</schema>
A pattern
in Schematron does
not carry the same meaning as patterns in RELAX NG. Here, it’s just a
logical grouping of rules. If your schema is testing books, one
pattern may hold rules for chapters while another groups rules for
appendixes. So think of this as more of a higher-level, conceptual
testing pattern, rather than as a specific node-matching
pattern.
The context for each test is determined by a rule
. Its context
attribute contains an XSLT pattern
that matches nodes. Each node found becomes the context
node, on which all tests inside the rule are
applied.
The children of a rule, report
and assert
, each apply a test to the context
node. The test is another XPath expression, stored in a test
attribute. report
’s contents will be output if its
XPath expression evaluates to “true.” assert
is just the opposite, outputting its
contents if its test evaluates to “false.”
XPath expressions are very good at describing XML nodes and reasonably good at matching text patterns. Here’s how you might test an email address:
<rule context="email"> <p>Found an email address...</p> <assert test="contains(.,'@')">Error: no @ in email</assert> <assert test="contains(.,'.')">Error: no dot in email</assert> <report test="length(.)>20">Warning: email is unusually long</report> </rule>
To summarize, running a Schematron validator on a document works
like this. First, parse the document to build a document tree in
memory. Then, for each rule, obtain a context node using its XPath
locator expression. For each assert
or report
in the rule, evaluate the
XPath expression for a Boolean value, and conditionally output text.
The idea is that whenever something is found that is not right with
the document, the Schematron processor should output a message to that
effect. You can think of Schematron as a language for generating
validation reports.
One interesting feature of Schematron is that its documentation
is a part of the language itself. Rather than rely on comments or the
namespace hack from RELAX NG, this language explicitly defines
elements and attributes to hold commentary. The root element, schema
has an optional child title
to name the schema, and pattern
elements have a name
attribute for identifying rule groups.
A Schematron validator will use that attribute to label each pattern
of testing in output. There is also a set of tags for formatting text,
borrowed from HTML, such as p
and
span
.
Let’s look at an example. Below is a schema to test a report document. There are two kinds of reports we allow: one with a body and another with a set of at least three sections.
<schema xmlns="http://www.ascc.net/xml/schematron"> <title>Test: Report Document Validity</title> <pattern name="Type 1"> <p>Type 1 reports should have a title and a body.</p> <rule context="/"> <assert test="report">Wrong root element. This isn't a report.</assert> </rule> <rule context="report"> <assert test="title">Darn! It's missing a title.</assert> <report test="title">Yup, found a title.</assert> <assert test="body">Yikes! It's missing a body.</assert> <report test="body">Yup, found a body.</assert> </rule> </pattern> <pattern name="Type 2"> <p>Type 2 reports should have a title and <em>at least three</em> sections.</p> <rule context="/"> <assert test="report">Wrong root element. This isn't a report.</assert> </rule> <rule context="report"> <assert test="title">Darn! It's missing a title.</assert> <report test="title">Yup, found a title.</assert> <assert test="count(section)>2">There are not enough section elements in this report.</assert> <report test="count(section)>2">Plenty of sections, so I'm happy.</assert> </rule> </pattern> </schema>
Now, let’s run the Schematron validator on this document:
<report> <title>A ridiculous report</title> <body> <para>Here's a paragraph.</para> <para>Here's a paragraph.</para> </body> </report>
I used a version of Schematron that outputs its report in HTML form. Figure 4-1 shows how it looks in my browser.
Abstract Rules
An abstract rule allows you to reuse rules when they are likely to appear
often in the schema. The syntax is the same, with the additional
attribute abstract
set to yes
and an id
with some unique value. Another rule will
reference the id
with a rule
attribute in an extends
child element. See the following
example.
<rule id="inline" abstract="yes"> <report test="*">Error! Element inside inline.</report> <assert test="text">Strange, there's no text inside this inline.</assert> </rule> <rule context="bold"> <extends rule="inline"/> </rule> <rule context="emphasis"> <extends rule="inline"/> </rule> <rule context="quote"> <extends rule="inline"/> </rule>
Get Learning XML, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.