Thursday, February 26, 2009

XML Schema whiteSpace and the token Data Type

In developing some new XML Schema documents, we came across an unexpected relationship between the XML Schema token datatype and the rules for processing whitespace.

In section 3.3.2 of the XML Schema Part 2: Datatypes Second Edition document, token is defined as follows:

[Definition:] token represents tokenized strings. The •value space• of token is the set of strings that do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, that have no leading or trailing spaces (#x20) and that have no internal sequences of two or more spaces. The •lexical space• of token is the set of strings that do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, that have no leading or trailing spaces (#x20) and that have no internal sequences of two or more spaces. The •base type• of token is normalizedString.

Based on this paragraph, we assumed any validator (XSV for example) would indicate an instance error if the value of a token element contained a carriage return, line feed, or tab character. I also assumed a validator would indicate an instance error if the value of a token element contained any leading or trailing spaces or any internal sequences of two or more spaces. These assumptions, however, were not correct as is demonstrated by the following example:

Given the following XML Schema declaration:

<xs:element type="xs:token" name="XmlToken" />

The following are considered valid by validators:

<XmlToken>
  Token
</XmlToken>
<XmlToken>   Token   </XmlToken>
<XmlToken>
  Token1        Token2 
  Token3
  Token4 Token5 Token6
</XmlToken>

The reason these are considered valid token values has to do with the whitespace processing rules. Per section 4.3.6 (whiteSpace) of the XML Schema Part 2: Datatypes Second Edition document, an token actually allows carriage return (#xD), line feed (#xA) and tab (#x9) characters to appear in the value.

[Definition:] whiteSpace constrains the •value space• of types •derived• from string such that the various behaviors specified in Attribute Value Normalization in [XML 1.0 (Second Edition)] are realized. The value of whiteSpace must be one of {preserve, replace, collapse}.

preserve
No normalization is done, the value is not changed (this is the behavior required by [XML 1.0 (Second Edition)] for element content)

replace
All occurrences of #x9 (tab), #xA (line feed) and #xD (carriage return) are replaced with #x20 (space) 

collapse
After the processing implied by replace, contiguous sequences of #x20's are collapsed to a single #x20, and leading and trailing #x20's are removed.

whiteSpace is applicable to all •atomic• and •list• datatypes. For all •atomic• datatypes other than string (and types •derived• by •restriction• from it) the value of whiteSpace is collapse and cannot be changed by a schema author; for string the value of whiteSpace is preserve; for any type •derived• by •restriction• from string the value of whiteSpace can be any of the three legal values. For all datatypes •derived• by •list• the value of whiteSpace is collapse and cannot be changed by a schema author.

Since the whiteSpace value for a token is collapse, all whitespace characters are replaced with a space character, all leading and trailing space characters are removed, and all contiguous space characters are collapsed into a single space character before the XML instance document is validated.

Since all the offensive characters are removed before the document is validated, token values can actually contain carriage return, line feed, and tab character, even though they are forbidden by section 3.2.2 (token) of the XML Schema Part 2: Datatypes Second Edition document.