When a schema is attached to a document, Word performs on-the-fly schema validation of the document’s embedded custom XML, visibly flagging errors as the user edits. However, since the custom XML tags are intertwined with WordprocessingML elements, Word first needs to strip out the Word-specific markup before it can validate the document. This is actually the same process—the “Save data only” process—that optionally occurs in step 3 of our processing model diagram (in Figure 4-7), when a user saves the document. What is not evident in that diagram is the fact that the “Save data only” process is also invoked repeatedly while the user is editing the document (during step 2). The difference here is that, rather than permanently stripping out the WordprocessingML markup, it does so temporarily just for the purpose of validation.
When Word strips out the
WordprocessingML markup in order to validate the embedded XML
document, by default it leaves all text content (inside
w:t
elements)
intact. Our press release template, however, includes boilerplate
text that is not actually part of our data. If this text is included
in the remaining XML document, then it will be invalid according to
the press release schema. Example 4-7 shows what a
press release XML document would look like if all of the text
remained intact after stripping out the WordprocessingML markup.
Example 4-7. An invalid press release document, including template boilerplate text
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?mso-application progid="Word.Document"?> <pressRelease xmlns="http://xmlportfolio.com/pressRelease"><company><name>ACME Corp.</name><address><street>555 Market St.</street><city>Seattle</city>, <state>WA</state> <zip>98101</zip>Phone
<phone>222-222-2222</phone>Fax <fax>333- 333-3333</fax></address></company>Press Release
<contact>Contact:
<firstName>John</firstName> <lastName>Doe</lastName>Phone:
<phone>444-444- 4444</phone></contact>FOR IMMEDIATE RELEASE
<date>2004-01-23</date><title>This is the Headline</title><body><para>This is the lead-in, and this is not. The rest of the paragraph has no formatting either.This is the second paragraph. These are just regular Word paragraphs. They do not correspond to custom XML elements.</para></body>-End-
</pressRelease>
The highlighted segments of Example 4-7, such as
Phone
and FOR IMMEDIATE
RELEASE
, are pieces of boilerplate text from the
press release template. They are not supposed to be part of the data.
Thus, merely stripping out the WordprocessingML markup is not
sufficient. It is also necessary to strip out the boilerplate text.
How is this done? Well, the boilerplate text in this example happens
to represent the only mixed content text in the document, and Word
happens to provide a document option called “Ignore
mixed content.” By turning this option on, you can
effectively strip out the boilerplate text in this and other similar
examples, for the purpose of validation.
The “Ignore mixed content” document option can be viewed as a parameter to the “Save data only” process. It affects both on-the-fly schema validation as well as the document saving process when the “Save data only” document option is turned on. (The precise behavior of this process is approximated using an XSLT stylesheet listed later in this chapter, under “The `Save Data Only’ Document Option”.)
In our press release template, the “Ignore mixed
content” document option is turned
on
, but the “Save data
only” document option is turned
off
. This means that mixed content text is
stripped out for the purpose of on-the-fly schema validation, but it
is not stripped out when the document is saved. (Instead, our press
release template uses a custom onsave
XSLT
stylesheet applied directly to the merged XML and WordprocessingML
representation.)
The “Ignore mixed content” document
option is represented in WordprocessingML using the
w:ignoreMixedContent
element. Our press release application’s
“Elegant” stylesheet,
pr2word.xsl
, turns the option on by generating a
w:ignoreMixedContent
element in the result
document, just like this one:
<w:docPr> <!-- ... --> <w:ignoreMixedContent/> <!-- ... --> </w:docPr>
Get Office 2003 XML now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.