Whitespace in xsd:string, xsd:normalizedString, xsd:token

The XML Schema Recommendation is clear about the definition of xs:string, xs:normalizedString and xs:token but not about their use. The spec and most books about XML Schema fail to tell us that these data types are mostly a show of intent.



1. Whitespace in string types

The string data type represents character strings in XML and we are allowed to use the following whitespace characters:

Note that we have other whitespace characters like non breaking space ( ) not considered whitespace in this context.

NormalizedString is the set of strings that do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters. NormalizedString can only contain the whitespace character for a space, #x20.

Token is the set of strings that do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, that have no leading or trailing spaces (#x20) and that have no internal sequences of two or more spaces.

2. Validation doesn't care

To our great surprise elements and attributes declared as xs:normalizedString and xs:token accept all whitespace characters we stuff into them, and token accepts leading and trailing space and internal sequences of two or more spaces. [1]


2.1 Let a test convince you

I have made an XML Schema schema, spaces.xsd, and an XML instance document, spaces.xml.

<?xml version="1.0"?><xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"><xs:element name="testdoc"><xs:complexType><xs:sequence><xs:element name="alpha" type="xs:string"/><xs:element name="bravo" type="xs:normalizedString"/><xs:element name="charlie" type="xs:token"/></xs:sequence></xs:complexType></xs:element></xs:schema>

All the three elements of the XML instance document, alpha, bravo and charlie, contain the same cocktail of all sorts of whitespace, consecutive spaces and leading and trailing space. The elements are of data type xsd:string, xsd:normalizedString and xsd:token. The document is damned valid, no errors are reported.

<?xml version="1.0"?><testdoc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="spaces.xsd"><alpha>&#x20;The&#x9;great&#x20;&#x20;whitespace&#xd;&#xa;test.&#x20;</alpha><bravo>&#x20;The&#x9;great&#x20;&#x20;whitespace&#xd;&#xa;test.&#x20;</bravo><charlie>&#x20;The&#x9;great&#x20;&#x20;whitespace&#xd;&#xa;test.&#x20;</charlie></testdoc>

3. A show of intent

By using xs:normalizedString you are telling software making use of the XML document that it is allowed to replace TAB, CARRIAGE RETURN and LINEFEED characters with space characters.

By using xs:token you are telling software making use of the XML document that it is also allowed to delete leading and trailing whitespace and allowed to reduce all sequences of two or more spaces to one.

4. Leave it to machines

When we type in data it is easy to make mistakes: here and there an extra space or two between words, and leading and trailing whitespace. Tabs, Carriage Returns and or Linefeeds can easily make it to the text.

The problem is not that mistakes occur but that they are extremely difficult to spot and correct by hand. Whitespace characters are hidden in most views. If xs:normalisedString and xs:token returned validation errors, the poor user would have a hell of a time finding those whitespace characters and correct them.


5. Schema-aware XSLT 2.0

XSLT 2.0 processors can be of two versions: a basic processor and a schema-aware processor. The SAXON processor comes in 8SA (sa = schema aware) and 8B (B = basic) versions at the moment of this writing. XMLSpy 2007 only exist in the schema-aware version.

If we have and XML document with data types of xs:normalizedString and xs:token declared in a schema, and the input document links to that schema, then the XSLT processor gets the message. The intent signalled by normalizedString and token is picked up and action is taken by the processor.

If xs:normalizedString or xs:token is used, all whitespace charactes other than space characters are replaced by space characters, and for xs:token leading and trailing whitespace is also removed, and all spaces in sequence are reduced to just one space. [2]

I have made an XSLT 2.0 stylesheet, spaces.xsl, for a schema-aware processor. If you use the small XML schema instance document from before as input file, you will see that the whitespace in the output file is replaced and collapsed in accordance with the data types declared in the schema.

6. Understanding XML Schema

Now when we have got a better understanding of what xs:normalizedString and xs:token is all about, it is much easier to comprehend the whiteSpace facet's values of "preserve", "replace" and "collapse" in the XML Schema Recommendation.

It is also a good time to re-read White Space Normalization during Validation. Now it makes sense. NormalizedString and token can only contain those character strings defined in the top of this article. If the wrong whitespace appear, it is replaced or removed (collapsed) during validation in order not to report errors. But it is not replaced or removed from the document.

NormalizedString and token are not that special when we come to think about it. All xsd datatypes except xsd:string (whiteSpace="preserve") and xsd:normalizedString (whiteSpace="replace") have whiteSpace="collapse" as a fixed value that can not be changed.

Even xsd:integer or xsd:dateTime validate if data contain leading and trailing whitespace like tabs and linefeeds. The xsd types are designed to be able to contain illegal whitespace characters! If we cannot live with them we most use a tool like a schema-aware XSLT 2.0 processor and tidy things up. [3]

Footnotes

[1]

xsd and xs are more or less equally common as prefix for the XML Schema namespace. I use both in this article not to be inconsistent but to be more search engine friendly. Some of you might search for xs:token others for xsd:token.

[2]

The XSLT 2.0 schema-aware processor only tidy data up removing illegal whitespace characters if we use <xsl:value-of>. If we use the identity template with <xsl:copy> everything is copied over as it is. See my question and the answers at the XSL-mailing list.

[3]

We have all sorts of methods to clean up data. We can use functions in XSLT 1.0 or in other programming languages. The nice thing is that we can now normalize whitespace automatically with schema-aware software.

Updated 2009-08-06