4. Using Predefined Simple Datatypes

Lexical and Value Spaces

W3C XML Schema introduced a decoupling between the data, as it can be read from the instance documents (the “lexical space”), and the value, as interpreted according to the datatype (the “value space”).

Before we can enter into the definition of these two spaces, we must examine the processing model and the transformations endured by a value written in a XML document before it is validated. Element and attribute content proceeds through the following steps during processing:

Serialization space: The series of bytes that is actually stored in a document (either as the value of an attribute or as a text node) may be seen as belonging to a first space, which we may call the “serialization space.”
Parsed space: The XML 1.0 Recommendation makes it clear that the serialization space is not directly meaningful to applications, and a first transformation is performed on the value by conforming XML parsers before the value reaches an application: characters are converted into Unicode, and ends of lines (for text nodes and attributes) and whitespaces (only for attributes) are normalized. The result of this transformation is what reaches the applications—including schema processors—and belongs to what we may call the “parsed space.”
Lexical space: Before doing any validation, W3C XML Schema performs a second round of whitespace processing on this value reported by the XML parser. This depends on the value’s datatype and may either ignore, normalize, or collapse the whitespaces. The value after this whitespace processing belongs to the “lexical space” defined in the W3C XML Schema Recommendation.
Value space: W3C XML Schema considers an item from the lexical space to be a representation of an abstract value whose meaning or semantic is defined by its datatype and can be a piece of text, and also a number, a date, or qualified name. The ensemble of abstract values is defined as the “value space.”

Each datatype has its own lexical and value spaces and its own rules to associate a lexical representation with a value; for many datatypes, a single value can have multiple lexical representations (for instance, the <xs:float> value “3.14116” can also be written equivalently as “03.14116,” “3.141160,” or “.314116E1”). This distinction is important since the basic operations performed on the values (such as equality testing or sorting) are done on the value space. “3.14116” is considered to be equal to “03.14116” when the type is xs:float and is different when the type is xs:string . The same applies to sort orders: some datatypes have a full order relation (every pair of values can be compared), others have no order relation at all, and the remaining types have a partial order relation (values cannot always be compared).

Tip

Although future versions of APIs might send these values to the applications, the transformations between parsed, lexical, and value spaces are currently done for the sake of the validation only, and don’t impact the values sent by a validating parser.

Whitespace Processing

The handling of special characters (tab, linefeeds, carriage returns and spaces, which are often used only to “pretty print” XML documents) has always been very controversial. W3C XML Schema has imposed a two-step generic algorithm, which is applied to most of the predefined datatypes (actually, on all of them except two, xs:string and xs:normalizedString).

Whitespace replacement: This is the first step of whitespace processing applied to the parsed value. During whitespace replacement, all occurrences of any whitespace—#x9 (tab), #xA (linefeed), and #xD (carriage return)—are replaced with a space (#x20). The number of characters is not changed by this step, which is applied to all the predefined datatypes (except for xs:string , since no whitespace replacement is performed on the parsed value for this).
Whitespace collapse: The second step removes the leading and trailing spaces, and replaces all contiguous occurrences of spaces by a single space character. This is applied on all the predefined datatypes (except for xs:string , since no whitespace replacement is performed on the parsed value for this, and for xs:normalizedString , in which whitespaces are only normalized).

Tip

This notion of “normalized string” does not match the XPath function normalize-space( ), which corresponds with what W3C XML Schema calls whitespace collapsing. It is also different from the DOM normalize() method, which is a merge of adjacent text objects.

String Datatypes

This section discusses datatypes derived from the xs:string primitive datatype as well as other datatypes that have a similar behavior (namely, xs:hexBinary , xs:base64Binary , xs:anyURI , xs:QName , and xs:NOTATION ). These types are not expected to carry any quantifiable value (W3C XML Schema doesn’t even expect to be able to sort them) and their value space is identical to their lexical space except when explicitly described otherwise. One should note that even though they are grouped in this section because they have a similar behavior, these primitive datatypes are considered quite different by the Recommendation.

The datatypes covered in this section are shown in Figure 4-2.

Figure 4-2. Strings and similar datatypes

The two exceptions in whitespace processing ( xs:string and xs:normalizedString ) are string datatypes. One of the main differences between these types is the applied whitespace processing. To stress this difference, we will classify these types by their whitespace processing.

No Whitespace Replacement

xs:string

This string datatype is the only predefined datatype for which no whitespace replacement is performed. As we will see in the next chapter, the whitespace replacement performed on user-defined datatypes derived from this type can be defined without restriction. On the other hand, a user datatype cannot be defined as having no whitespace replacement if it is derived from any predefined datatype other than xs:string .

As expected, a string is a set of characters matching the definition given by XML 1.0, namely, “legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.”

The value of the following element:

<title lang="en">
   Being a Dog Is 
   a Full-Time Job
</title>

is the full string:

Being a Dog Is 
a Full-Time Job

with all its tabs, and CR/LF if the title element is a type xs:string .

Normalized Strings

xs:normalizedString

The normalized string is the only predefined datatype in which whitespace replacement is performed without collapsing.

The lexical space of xs:normalizedString is the same as the lexical space of xs:string from which it is derived—except that since any occurrence of #x9 (tab), #xA (linefeed), and #xD (carriage return) are replaced by a #x20 (space), these three characters cannot be found in its lexical and value spaces.

The value of the same element:

<title lang="en">
   Being a Dog Is 
   a Full-Time Job
</title>

is now the string:

    Being a Dog Is     a Full-Time Job

in which all the whitespaces have been replaced by spaces if the title element is a type xs:normalizedString .

Tip

There is no additional constraint on normalized strings. Any value that is a valid as a xs:string is also valid xs:normalizedString but its tabs, linefeed and CR will be replaces by spaces. The difference is the whitespace processing that is applied when the lexical value is calculated.

Collapsed Strings

Whitespace collapsing is performed after whitespace replacement by trimming the leading and trailing spaces and replacing all the contiguous occurrences of spaces with a single space. All the predefined datatypes (except, as we have seen, xs:string and xs:normalizedString ) are whitespace collapsed.

We will classify tokens, binary formats, URIs, qualified names, notations, and all their derived types under this category. Although these datatypes share a number of properties, we must stress again that this categorization is done for the purpose of explanation and does not directly appear in the Recommendation.

Tokenss

xs:token

xs:token is xs:normalizedString on which the whitespaces have been collapsed. Since whitespaces are accepted in the lexical space of xs:token , this type is better described as a " tokenized” string than as a “token”!

The same element:

<title lang="en">
   Being a Dog Is 
   a Full-Time Job
</title>

is still a valid xs:token , and its value is now the string:

Being a Dog Is a Full-Time Job

in which all the whitespaces have been replaced by spaces, any trailing spaces are removed, and contiguous sequences of spaces are replaced by single spaces.

Tip

As is the case with xs:normalizedString , there is no constraint on xs:token , and any value that is a valid xs:string is also a valid xs:token . The difference is the whitespace processing that is applied when the lexical value is calculated. This is not true of derived datatypes that have additional constraints on their lexical and value space. The restriction on the lexical spaces of xs:normalizedString is, therefore, a restriction by projection of their parsed space (different values of their parsed space are transformed into a single value of their lexical space), and not a restriction by invalidating values of their lexical space, as is the case for all the other predefined datatypes.

The predefined datatypes derived from xs:token are xs:language , xs:NMTOKEN , and xs:Name .

xs:language

This was created to accept all the language codes standardized by RFC 1766. Some valid values for this datatype are en, en-US, fr, or fr-FR.

xs:NMTOKEN

This corresponds to the XML 1.0 “Nmtoken” (Name token) production, which is a single token (a set of characters without spaces) composed of characters allowed in XML name. Some valid values for this datatype are "Snoopy", "CMS", "1950-10-04", or "0836217462". Invalid values include "brought classical music to the Peanuts strip" (spaces are forbidden) or "bold,brash" (commas are forbidden).

xs:Name

This is similar to xs:NMTOKEN with the additional restriction that the values must start with a letter or the characters “:” or “-”. This datatype conforms to the XML 1.0 definition of a “Name.” Some valid values for this datatype are Snoopy, CMS, or -1950-10-04-10:00. Invalid values include 0836217462 ( xs:Name cannot start with a number) or bold,brash (commas are forbidden). This datatype should not be used for names that may be “qualified” by a namespace prefix, since we will see another datatype ( xs:QName ) that has a specific semantic for these values.The datatype xs:NCName is derived from xs:Name .

xs:NCName

This is the "noncolonized name” defined by Namespaces in XML1.0, i.e., a xs:Name without any colons (“:”). As such, this datatype is probably the predefined datatype that is closest to the notion of a “name” in most of the programming languages, even though some characters such as “-” or “.” may still be a problem in many cases. Some valid values for this datatype are Snoopy, CMS, -1950-10-04-10-00, or 1950-10-04. Invalid values include -1950-10-04:10-00 or bold:brash (colons are forbidden). xs:ID , xs:IDREF , and xs:ENTITY are derived from xs:NCName .

xs:ID

This is derived from xs:NCName . The one constraint added to its value space is that there must not be any duplicate values in a document. In other words, the values of attributes or simple type elements having this datatype can be used as unique identifiers, and this datatype emulates the XML 1.0 ID attribute type. We will see this feature in more detail in Chapter 9.

xs:IDREF

This is derived from xs:NCName . The constraint added to its value space is it must match an ID defined in the same document. I will explain this feature in more detail in Chapter 9.

xs:ENTITY

Also provided for compatibility with XML 1.0 DTDs, this is derived from xs:NCName and must match an unparsed entity defined in a DTD.

Tip

XML 1.0 gives the following definition of unparsed entities: “an unparsed entity is a resource whose contents may or may not be text, and if text, may be other than XML. Each unparsed entity has an associated notation, identified by name. Beyond a requirement that an XML processor make the identifiers for the entity and notation available to the application, XML places no constraints on the contents of unparsed entities.” In practice, this mechanism has seldom been used, as the general usage is to define links to the resources that could be defined as unparsed entities.

Qualified names

xs:QName

Following Namespaces in XML 1.0, xs:QName supports the use of namespace-prefixed names. A namespace prefix xs:QName treats a shortcut to identify a URI. Each xs:QName effectively contains a set of tuples {namespace name, local part}, in which the namespace name is the URI associated to the prefix through a namespace declaration. Even though the lexical space of xs:QName is very close to the lexical space of xs:Name (the only constraint on the lexical space is that there is a maximum of one colon allowed in an xs:QName , which cannot be the first character), the value spaces of these datatypes are completely different (a scalar for xs:Name and a tuple for xs:QName ) and xs:QName is defined as a primitive datatype. The constraint added by this datatype over an xs:Name is the prefix must be defined as a namespace prefix in the scope of the element in which this datatype is used.

W3C XML Schema itself has already given us some examples of QNames. When we write <xs:attribute name="lang" type="xs:language"/>, the type attribute is an xs:QName and its value is the tuple:

{"http://www.w3.org/2001/XMLSchema", "language"}

because the URI:

"http://www.w3.org/2001/XMLSchema"

was assigned to the prefix "xs:“. If there is no namespace declaration for this prefix, the type attribute is considered invalid.

The prefix of an xs:QName is optional. We are also able to write:

<xs:element ref="book" maxOccurs="unbounded"/>

in which the ref attribute is also a xs:QName and its value the tuple:

{NULL, "book"}

because we haven’t defined any default namespace. xs:QName does support default namespaces; if a default namespace is defined in the scope of this element, the value of its URI is used for this tuple.

URIs

xs:anyURI

This is another string datatype in which lexical and value spaces are different. This datatype tries to compensate for the differences of format between XML and URIs as specified in the RFCs 2396 and 2732. These RFCs are not very friendly toward non-ASCII characters and require many character escapings that are not necessary in XML. The W3C XML Schema Recommendation doesn’t describe the transformation to perform, noting only that it is similar to what is described for XLink link locators.

As an example of this transformation, the href attribute of an XHTML link written as:

<a href="http://dmoz.org/World/Français/">
  Word/Français
</a>

would be converted to the value:

http://dmoz.org/World/Fran%c3%a7ais/

in the value space.

The xs:anyURI datatype doesn’t pay any attention to xml:base attributes that may have been defined in the document.

Notations

xs:NOTATION: This is probably the most obscure of these string datatypes. This datatype was created to implement the XML 1.0 notations. It cannot be used directly in a schema; it must be used through user-defined derived datatypes. We will see more of it in the next chapter.

Binary string-encoded datatypes

XML 1.0 is unable to hold binary content, which must be string-encoded before it can be included in a XML document. W3C XML Schema has defined two primary datatypes to support two encodings that are commonly used (hexBinary and base64). These encodings may be used to include any binary content, including text formats whose content may be incompatible with the XML markup. Other binary text encodings may also be used (such as uuXXcode, Quote Printable, BinHex, aencode, or base85, to name a few), but their value would not be recognized by W3C XML Schema.

xs:hexBinary

This defines a simple way to code binary content as a character string by translating the value of each binary octet into two hexadecimal digits. This encoding is different from the encoding method called BinHex (introduced by Apple, described by RFC 1741, and includes a mechanism to compress repetitive characters).

A UTF-8 XML header such as:

<?xml version="1.0" encoding="UTF-8"?>

that is encoded as xs:hexBinary would be:

3f3c6d78206c657673726f693d6e3122302e20226e656f636964676e223d54552d4622383e3f

xs:base64Binary

This matches the encoding known as “base64” and is described in RFC 2045. It maps groups of 6 bits into an array of 64 printable characters.

The same header encoded as xs:base64Binary would be:

PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCg==

The W3C XML Schema Recommendation missed the fact that RFC 2045 requests a line break every 76 characters. This should be clarified in an errata. The consequence of these line breaks being thought of as optional by W3C XML Schema, is that the lexical and value spaces of xs:base64Binary cannot be considered identical.

Numeric Datatypes

The numeric datatypes are built on top of four primitive datatypes: xs:decimal for all the decimal types (including the integer datatypes, considered decimals without a fractional part), xs:double and xs:float for single and double precision floats, and xs:boolean for Booleans. Whitespaces are collapsed for all these datatypes.

The datatypes covered in this section are shown in Figure 4-3.

Figure 4-3. Numeric datatypes

Decimal Types

All decimal types are derived from the xs:decimal primary type and constitute a set of predefined types that address the most common usages.

xs:decimal

This datatype represents the decimal numbers. The number of digits can be arbitrarily long (the datatype doesn’t impose any restriction), but obviously, since a XML document has an arbitrary but finite length, the number of digits of the lexical representation of a xs:decimal value needs to be finite. Although the number of digits is not limited, we will see in the next chapter how the author of a schema can derive user-defined datatypes with a limited number of digits if needed.

Leading and trailing zeros are not significant and may be trimmed. The decimal separator is always a dot (“.”); a leading sign (“+” or “-”) may be specified and any characters other than the 10 digits (including whitespaces) are forbidden. Scientific notation (“E+2”) is also forbidden and has been reserved to the float datatypes only.

Valid values for xs:decimal include:

123.456

+1234.456

-1234.456

-.456

-456

The following values are invalid:

1 234.456 (spaces are forbidden)

1234.456E+2 (scientific notation (“E+2”) is forbidden)

+ 1234.456 (spaces are forbidden)

+1,234.456 (delimiters between thousands are forbidden)

xs:integer is the only datatype directly derived from xs:decimal .

xs:integer

This integer datatype is a subset of xs:decimal , representing numbers which don’t have any fractional digits in its lexical or value spaces. The characters that are accepted are reduced to 10 digits and an optional leading sign. Like its base datatype, xs:integer doesn’t impose any limitation on the number of digits, and leading zeros are not significant.

Valid values for xs:integer include:

123456

+00000012

-1

-456

The following values are invalid:

1234 (spaces are forbidden)

1. (the decimal separator is forbidden)

+1,234 (delimiters between thousands are forbidden).

xs:integer has given birth to three derived datatypes: xs:nonPositiveInteger and xs:nonNegativeInteger (which have still an unlimited length) and xs:long (to fit in a 64-bit word).

xs:nonPositiveInteger and xs:negativeInteger

The W3C XML Schema Working Group thought that it would be more clear that the value “0” was included if they used litotes as names, and used xs:nonPositiveInteger if the integers are negative or null. xs:negativeInteger is derived from xs:nonPositiveInteger to represent the integers that are strictly negative. These two datatypes allow integers of arbitrary length.

xs:nonNegativeInteger and xs:positiveInteger

Similarly, xs:nonNegativeInteger is the integers that are positive or equal to zero and xs:positiveInteger is derived from this type. The “unsigned” family branch ( xs:unsignedLong , xs:unsignedInt , xs:unsignedShort , and xs:unsignedByte ) is also derived from xs:nonNegativeInteger .

xs:long , xs:int , xs:short , and xs:byte .

The datatypes we have seen up to now have an unconstrained length. This approach isn’t very microprocessor-friendly. This subfamily represents signed integers that can fit into 8, 16, 32, and 64-bit words. xs:long is defined as all of the integers between -9223372036854775808 and 9223372036854775807, i.e., the values that can be stored in a 64-bit word. The same process is applied again to derive xs:int with a range between -2147483648 and 2147483647 (32 bits), to derive xs:short with a range between -32768 and 32767 (16 bits), and to derive xs:byte with a range between -128 and 127 (8 bits).

xs:unsignedLong , xs:unsignedInt , xs:unsignedShort , and xs:unsignedByte .

The last of the predefined integer datatypes is the subfamily of unsigned (i.e., positive) integers that can fit into 8, 16, 32, and 64-bit words. xs:unsignedLong is defined as the integers in a range between 0 and 18446744073709551615, i.e., the values that can be stored in a 64-bit word. The same process is applied again to derive xs:unsignedInt with a range between 0 and 4294967295 (32 bits), to derive xs:unsignedShort with a range between 0 and 65535 (16 bits), and to derive xs:unsignedByte with a range between 0 and 255 (8 bits).

Float Datatypes

xs:float and xs:double

xs:float and xs:double are both primitive datatypes and represent IEEE simple (32 bits) and double (64 bits) precision floating-point types. These store the values in the form of mantissa and an exponent of a power of 2 (m x 2^e), allowing a large scale of numbers in a storage that has a fixed length. Fortunately, the lexical space doesn’t require that we use powers of 2 (in fact, it doesn’t accept powers of 2), but instead lets us use a traditional scientific notation with integer powers of 10. Since the value spaces (powers of 2) don’t exactly match the values from the lexical space (powers of 10), the recommendation specifies that the closest value is taken. The consequence of this approximate matching is that float datatypes are the domain of approximation; most of the float values can’t be considered exact, and are approximate.

These datatypes accept several “special” values: positive zero (0), negative zero (-0) (which is greater than positive 0 but less than any negative value), infinity (INF) (which is greater than any value), negative infinity (-INF) (which is less than any float, and “not a number” (NaN).

Valid values for xs:float and xs:double include:

123.456

+1234.456

-1.2344e56

-.45E-6

INF

-INF

NaN

The following values are invalid:

1234.4E 56 (spaces are forbidden)

1E+2.5 (the power of 10 must be an integer)

+INF (positive infinity doesn’t expect a sign)

NAN (capitalization matters in special values)

`xs:boolean`

xs:boolean: This is a primitive datatype that can take the values true and false (or 1 and 0).

Date and Time Datatypes

The datatypes covered in this section are shown in Figure 4-4.

Figure 4-4. Date and time datatypes

The Realm of ISO 8601

The W3C Recommendation, “XML Schema Part 2: Datatypes,” provides new confirmation of how difficult it is to fix time.

The support for date and time datatypes relies entirely on a subset of the ISO 8601 standard, which is the only format supported by W3C XML Schema. The purpose of ISO 8601 is to eliminate the risk of confusion between the various date and time formats used in different countries. In other words, W3C XML Schema does not support these local date and time formats, and imposes the usage of ISO 8601 for any datatype that has the semantic of a date or time. While this is a good thing for interchange formats, this is more questionable when XML is used to define user interfaces, since we will see that ISO 8601 is not very user friendly. The variations using the names of the months or different orders between year, month, and day are not the only victims of this decision: ISO 8601 imposes the usage of the Gregorian (Christian) calendar to the exclusion of calendars used by other cultures or religions.

ISO 8601 describes several formats to define date, times, periods, and recurring dates, with different levels of precision and indetermination. After many discussions, W3C XML Schema selected a subset of these formats and created a primitive datatype for each format that is supported.

The indeterminacy allowed in some of these formats adds a lot of difficulty, especially when comparisons or arithmetic are involved. For instance, it is possible to define a point in time without specifying the time zone, which is then considered undetermined. This undetermined time zone is identical all over the document (and between the schema and the instance documents) and it’s not an issue to compare two datetimes without a time zone. The problem arises when you need to compare two points in time, one with a time zone and the other without. The result of this comparison will be undetermined if these values are too close, since one of them may be between -13 hours and +12 hours of Coordinated Universal Time (UTC). Thus, the support of these datetime datatypes introduces a notion of “partial order relation.”

Another caveat with ISO 8601 is that time zones are only supported through the time difference from UTC, which ignores the notion of summer time. For instance, if an application working in British Summer Time (BST) wants to specify the time zone—and we have seen that this is necessary to be able to compare datetimes—the application needs to know if a date is in summer (the time zone will be one hour after UTC) or in winter (the time zone would then be UTC). ISO 8601 ignores the “named time zones” using the summer saving times (such as PST, BST, or WET) that we use in our day-to-day life; ignoring the time zones can be seen as a somewhat dangerous shortcut to specify that a datetime is on your “local time,” whatever it is.

Datatypes

Point in time: xs:dateTime

The xs:dateTime datatype defines a “specific instant of time.” This is a subset of what ISO 8601 calls a “moment of time.” Its lexical value follows the format “CCYY-MM-DDThh:mm:ss,” in which all the fields must be present and may optionally be preceded by a sign and leading figures, if needed, and followed by fractional digits for the seconds and a time zone. The time zone may be specified using the letter “Z,” which identifies UTC, or by the difference of time with UTC.

Tip

The value space of xs:dateTime is considered to be the moment of time itself. The time zone that defines the value (when there is one) is considered meaningless, which is a problem for some applications that complain that even though 2002-01-18T12:00:00+00:00 and 2002-01-18T11:00:00-01:00 refer to the same “moment of time,” they carry different time zone information, which should make its way into the value space.

Valid values for xs:dateTime include:

2001-10-26T21:32:52

2001-10-26T21:32:52+02:00

2001-10-26T19:32:52Z

2001-10-26T19:32:52+00:00

-2001-10-26T21:32:52

2001-10-26T21:32:52.12679

The following values are invalid:

2001-10-26 (all the parts must be specified)

2001-10-26T21:32 (all the parts must be specified)

2001-10-26T25:32:52+02:00 (the hours part (25) is out of range)

01-10-26T21:32 (all the parts must be specified)

In the valid examples given above, three of them have identical value spaces:

2001-10-26T21:32:52+02:00

2001-10-26T19:32:52Z

2001-10-26T19:32:52+00:00

The first one (2001-10-26T21:32:52), which doesn’t include a time zone specification, is considered to have an indeterminate value between 2001-10-26T21:32:52-14:00 and 2001-10-26T21:32:52+14:00. With the usage of summer saving time, this range is subject to national regulations and may change. The range was between -13:00 and +12:00 when the Recommendation was published, but the Working Group has kept a margin to accommodate possible changes in the regulations.

Despite the indeterminacy of the time zone when none is specified, the W3C XML Schema Recommendation considers that the values of datetimes without time zones implicitly refer to the same undetermined time zone and can be compared between them. While this is fine for “local” applications that operate in a single time zone, this is a source of potential confusion and errors for world-wide applications or even for applications that calculate a duration between moments belonging to different time saving seasons within a single time zone.

Periods of time: xs:date , xs:gYearMonth and xs:gYear .

The lexical space of xs:date datatype is identical to the date part of xs:dateTime . Like xs:dateTime , it includes a time zone that should always be specified to be able to compare two dates without ambiguity. As defined per W3C XML Schema, a date is a period one day in its time zone, “independent of how many hours this day has.” The consequence of this definition is that two dates defined in a different time zone cannot be equal except if they designate the same interval (2001-10-26+12:00 and 2001-10-25-12:00, for instance). Another consequence is that, like with xs:dateTime , the order relation between a date with a time zone and a date without a time zone is partial.

Valid values for xs:date include:

2001-10-26

2001-10-26+02:00

2001-10-26Z

2001-10-26+00:00

-2001-10-26

-20000-04-01

The following values are invalid:

2001-10 (all the parts must be specified)

2001-10-32 (the days part (32) is out of range)

2001-13-26+02:00 (the month part (13) is out of range)

01-10-26 (the century part is missing)

xs:date represents a day identified by a Gregorian calendar date (and could have been called "gYearMonthDay“). xs:gYearMonth (“g” for Gregorian) is a Gregorian calendar month and xs:gYear is a Gregorian calendar year. These three datatypes are fixed periods of time and optional time zones may be specified for each of them. The only differences between them really are their length (1 day, 1 month, and 1 year) and their format (i.e., their lexical spaces).

The format of xs:gYearMonth is the format of xs:date without the day part. Valid values for xs:gYearMonth include:

2001-10

2001-10+02:00

2001-10Z

2001-10+00:00

-2001-10

-20000-04

The following values are invalid:

2001 (the month part is missing)

2001-13 (the month part is out of range)

2001-13-26+02:00 (the month part is out of range)

01-10 (the century part is missing)

The format of xs:gYear is the format of xs:gYearMonth without the month part. Valid values for xs:gYear include:

2001

2001+02:00

2001Z

2001+00:00

-2001

-20000

The following values are invalid:

01 (the century part is missing)

2001-13 (the month part is out of range)

This support of time periods is very restrictive: these periods can only match the Gregorian calendar day, month, or year, and cannot have an arbitrary length or start time.

Recurring point in time: xs:time

The lexical space of xs:time is identical to the time part of xs:dateTime . The semantic of xs:time represents a point in time that recurs every day; the meaning of 01:20:15 is “the point in time recurring each day at 01:20:15 am.” Like xs:date and xs:dateTime , xs:time accepts an optional time zone definition. The same issue arises when comparing times with and without time zones.

Note

Despite the fact that: 01:20:15 is commonly used to represent a duration of 1 hour, 20 minutes, and 15 seconds, a different format has been chosen to represent a duration.

Valid values for xs:time include:

21:32:52

21:32:52+02:00

19:32:52Z

19:32:52+00:00

21:32:52.12679

The following values are invalid:

21:32 (all the parts must be specified)

25:25:10 (the hour part is out of range)

-10:00:00 (the hour part is out of range)

1:20:10 (all the digits must be supplied)

This support of a recurring point in time is also very limited: the recursion period must be a Gregorian calendar day and cannot be arbitrary.

Recurring period of time: xs:gDay , xs:gMonth , and xs:gMonthDay .

We have already seen points in times and periods, as well as recurring points in time. This wouldn’t be complete without a description of recurring periods. W3C XML Schema supports three predefined recurring periods corresponding to Gregorian calendar months (recurring every year) and days (recurring each month or year). The support of recurring periods is restricted both in terms of recursion (the recursion period can only be a Gregorian calendar year or month) and period (the start time can only be a Gregorian calendar day or month, and the duration can only be a Gregorian calendar month or day).

xs:gDay is a period of a Gregorian calendar day recurring each Gregorian calendar month. The lexical representation of xs:gDay is ---DD with an optional time zone specification. Valid values for xs:gDay include:

---01

---01Z

---01+02:00

---01-04:00

---15

---31

The following values are invalid:

--30- (the format must be "---DD“)

---35 (the day is out of range)

---5 (all the digits must be supplied)

15 (missing the leading "---“)

The rules of arithmetic between dates and durations apply in this case, and days are “pinned” in the range for each month. In our example, --31, the selected dates will be January 31st, February 28th (or 29th), March 31st, April 30th, etc.

xs:gMonthDay is a period of a Gregorian calendar day recurring each Gregorian calendar year. The lexical representation of xs:gMonthDay is --MM-DD with an optional time zone specification. Valid values for xs:gMonthDay include:

--05-01

--11-01Z

--11-01+02:00

--11-01-04:00

--11-15

--02-29

The following values are invalid:

-01-30- (the format must be --MM-DD)

--01-35 (the day part is out of range)

--1-5 (one part is missing)

01-15 (the leading -- is missing)

xs:gMonth is a period of a Gregorian calendar month recurring each Gregorian calendar year. The lexical representation of xs:gMonth defined in the Recommendation is --MM-- with an optional time zone specification. The W3C XML Schema Working Group has acknowledged that this was an error and that the format --MM defined by ISO 8061 should be used instead. It has not been decided yet if the format described in the Recommendation will be forbidden or only deprecated, but it is advised to use the format --MM (assuming that the tools you are using already support it). Valid values for xs:gMonth include:

--05

--11Z

--11+02:00

--11-04:00

--02

The following values are invalid:

-01- (the format must be --MM)

--13 (the month is out of range)

--1 (both digits must be provided)

01 (the leading -- is missing)

xs:duration

Naive programmers who think that the concept of duration is simple should read the Recommendation, which states: xs:duration is defined as a six-dimensional space!” Mathematicians would object that this is not absolutely true since most of the axes of these dimension are parallel, but the fact is that when these programmers say that a development will last one month and 3 days, they define a duration that is comprised of between 31 and 34 days. The attempt of W3C XML Schema to deal with these issues on top of ISO 8601 has introduced a degree of indeterminacy in the comparisons between durations.

The lexical space of xs:duration is the format defined by ISO 8601 under the form PnYnMnDTnHnMnS, in which the capital letters are delimiters that can be omitted when the corresponding member is not used. An important difference with the format used for xs:dateTime is none of these members are mandatory and none of them are restricted to a range. This gives flexibility to choose the units that will be used and to combine several of them—for instance, P1Y2MT123S (1 year, 2 months, and 123 seconds). This flexibility has a price; such a duration is not completely defined: a year may have 365 or 366 days, and a period of two months lasts between 59 and 62 days. Durations cannot always be compared and the order between durations is partial. We will see, in the next chapter, that user-defined datatypes can be derived from xs:duration , which can restrict the components used to express durations and insure that these indeterminations do not happen.

Since the value of a duration is fixed as soon as you give it a starting point, the schema Working Group has identified four datetimes:

1696-09-01T00:00:00Z

1697-02-01T00:00:00Z

1903-03-01T00:00:00Z

1903-07-01T00:00:00Z

These cause the greatest deviations when durations mixing day, month, and other components are added. The Working Group has determined that the comparison of durations is undefined if—and only if—the result of the comparison is different when each of these dates is used as a starting point.

Valid values for xs:duration include:

PT1004199059S

PT130S

PT2M10S

P1DT2S

-P1Y

P1Y2M3DT5H20M30.123S

The following values are invalid:

1Y (the leading P is missing)

P1S (the T separator is missing)

P-1Y (all parts must be positive)

P1M2Y (the parts order is significant and Y must precede M)

P1Y-1M (all parts must be positive)

List Types

These datatypes are lists of whitespace-separated items. The type of these items (called the item type) is defined during the derivation process (which we will see in the next chapter) and list datatypes can be derived from any simple type. Three predefined datatypes are lists ( xs:NMTOKENS , xs:IDREFS , and xs:ENTITIES ). For all the list datatypes, the items must be separated by one or more whitespaces.

xs:NMTOKENS: This is a whitespace-separated list of xs:NMTOKEN . Each item of the list must be in the lexical space of xs:NMTOKEN .
xs:IDREFS: This is a whitespace-separated list of xs:IDREF . Each item of the list must be in the lexical space of xs:IDREF and must reference an existing xs:ID in the same document.
xs:ENTITIES: This is a whitespace-separated list of xs:ENTITY . Each item of the list must be in the lexical space of xs:ENTITY and must match an unparsed entity defined in a DTD.

What About anySimpleType?

We have now covered all the predefined datatypes except one, which is an atypical type: anySimpleType. This datatype is a kind of wildcard, which means, as expected, that any value is accepted and doesn’t add any constraint on the lexical space.

anySimpleType has two other characteristics that make it unique among simple types: users’ simple types cannot be derived from it and its properties, and its canonical form is not defined in the Recommendation! These characteristics make it a type that should be avoided, except when the rules of a derivation (which we will see in the next chapter) require its usage.

Back to Our Library

If we look back with a critical eye at our library, we see we used the following simple datatypes:

<xs:element name="name" type="xs:string"/>
      
<xs:element name="qualification" type="xs:string"/>
      
<xs:element name="born" type="xs:date"/>
      
<xs:element name="dead" type="xs:date"/>
      
<xs:element name="isbn" type="xs:string"/>
      
<xs:attribute name="id" type="xs:ID"/>
      
<xs:attribute name="available" type="xs:boolean"/>
      
<xs:attribute name="lang" type="xs:language"/>

We are lucky that the elements born and dead are ISO 8601 dates. The ISBN number is composed of numeric digits and a final character which can be either a digit or the letter “x"-and is therefore represented as a string. We also did a good job with the datatypes for the id, available and lang attributes, but the choice of xs:string for the elements name and qualification is more controversial. They appear in the instance document as:

<name>
  Charles M Schulz
</name>
                      .../...

<qualification>
  bold, brash and tomboyish
</qualification>

This formatting suggests that whitespaces are probably not significant and should be collapsed. This can be done by choosing the datatype xs:token instead of xs:string ; the same applies to the title element, which is a simple content derived from xs:string that would be better derived from xs:token . This change will not have any impact on the validation with our schema, but the document is more precisely described and future derivations would be more easily built on xs:token than on xs:string . The other datatype that could have been chosen better is isbn, which can be represented as xs:NMTOKEN. The new schema would then be:

<?xml version="1.0"?> 
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="name" type="xs:token"/>
  <xs:element name="qualification" type="xs:token"/>
  <xs:element name="born" type="xs:date"/>
  <xs:element name="dead" type="xs:date"/>
  <xs:element name="isbn" type="xs:NMTOKEN"/>
  <xs:attribute name="id" type="xs:ID"/>
  <xs:attribute name="available" type="xs:boolean"/>
  <xs:attribute name="lang" type="xs:language"/>
  <xs:element name="title">
    <xs:complexType>
      <xs:simpleContent>
        <xs:extension base="xs:token">
          <xs:attribute ref="lang"/>
        </xs:extension>
      </xs:simpleContent>
    </xs:complexType>
  </xs:element>
  <xs:element name="library">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="book" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="author">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="name"/>
        <xs:element ref="born"/>
        <xs:element ref="dead" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="book">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="isbn"/>
        <xs:element ref="title"/>
        <xs:element ref="author" minOccurs="0" maxOccurs="unbounded"/> 
        <xs:element ref="character" minOccurs="0"
          maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
      <xs:attribute ref="available"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="character">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="name"/>
        <xs:element ref="born"/>
        <xs:element ref="qualification"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

XML Schema by Eric van der Vlist

Chapter 4. Using Predefined Simple Datatypes

Lexical and Value Spaces

Tip

Whitespace Processing

Tip

String Datatypes

No Whitespace Replacement

Normalized Strings

Tip

Collapsed Strings

Tokenss

Tip

Tip

Qualified names

URIs

Notations

Binary string-encoded datatypes

Numeric Datatypes

Decimal Types

Float Datatypes

`xs:boolean`

Date and Time Datatypes

The Realm of ISO 8601

Datatypes

Tip

Note

List Types

What About anySimpleType?

Back to Our Library

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly