Whitespace

How parsers treat whitespace is one of the most commonly misunderstood areas of XML processing. There are four basic rules you need to remember:

  1. All whitespace in element content is always reported.

  2. Whitespace in attribute values is normalized.

  3. Whitespace in the prolog and epilog and within tags but outside attribute values is not reported.

  4. All non-escaped line breaks (carriage returns, line feeds, carriage return-line feed pairs, and, in XML 1.1, NEL and line separator) are converted to line feeds.

Consider Example 18-2.

Example 18-2. Various kinds of whitespace
<?xml version="1.0"?>
   
<!DOCTYPE person  SYSTEM "person.dtd ">
   
<person  source="Alan Turing: the Enigma, 
                  Andrew Hodges, 1983">
  <name>
    <first>Alan</first>
    <last>Turing</last>
  </name>
  <profession  id="p1" 
               value="computer  scientist "
               source="" />
  <profession  id="p2"
               value="mathematician"/>
  <profession  id="p3"
               value="cryptographer"/>
</person>

When a parser reads this document, it will report all the whitespace in the element content to the client application. This includes boundary whitespace like that between the <name> and <first> start-tags and the </last> and </name> end-tags. If the DTD says that the name element cannot contain mixed content, the whitespace is considered to be whitespace in element content , also called ignorable whitespace . However, the parser still reports it. The client application receiving the content from the parser may choose to ignore boundary whitespace, whether it’s ignorable or not, interpreting ...

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.