Jesper Tverskov, Marts 15, 2011

Understanding xml:space

The xml:space="preserve" attribute is common in some XML documents. But what the attribute means is obscured by the fact that it is often used for no good reason. It could be there in an element in the source code because some developer inserted it as an experiment and forgot to delete it again.

1. XML standard and xml:space

The xml:space attribute is defined in the XML Standard (W3C Recommendation), White space handling. Like the only other attribute defined in the standard, xml:lang, it is only a signal of intent. The xml:space attribute only matters if the software making use of the XML document recognizes it and acts upon it. This is often not the case.

The xml:space attribute share another important fact with the xml:lang attribute. If a document is going to be validated with a schema processor, the attribute must be declared in the schema. The same is true for DTD.

2. The "default" value

The xml:space attribute can only have two values, "default" and "preserve". Since the "default" value in most situations acts as if the attribute is not used, it is seldom used. Here is the exception: The xml:space="preserve" attribute applies to the element where it is declared, and in all descendents of that element, in children and their children. For that reason, xml:space="default" makes sense if we want to overrule what a descendent element has inherited from an ancestor element.

3. Whitespace only text nodes

The spec says that "the value 'preserve' indicates the intent that applications preserve all the white space". What whitespace? If we read the spec carefully there is no doubt that we are not talking about whitespace in general but only about what is called "whitespace only text nodes".

White space only text nodes are indention between elements like linefeed, carriage return, tab and space. Such whitespace characters are only called whitespace only text nodes if they are not mixed with other text nodes or not beside other text nodes. If there is just one character present, not being a whitespace only text node, the whitespace characters are no longer whitespace only text nodes.

Take a look at the following example:

  1. <root xml:space="preserve">
  2.     <test> This isxxxxxgreat. </test>
  3. </root>

If we transform the above document and use the normalize-space() function, the leading and trailing whitespace inside the "test" element will be removed, and the consecutive spaces between "is" and "great" will be reduced to just one space character. "xml:space" is not about whitespace in general.

But indention in the form of linefeed, carriage return, tab and space between elements with no other text nodes around, will be preserved, like the whitespace between the "root" and the "test" element. Even if we in an XSLT stylesheet say <xsl:strip-space element="*"/> and <xsl:output indention="no"/> the whitespace only text nodes in the above document will be preserved if we use an XSLT processor like Saxon.

4. A poem as example

In the poem below xml:space does not help us. If the "p" element is manipulated with XML tools using functions like normalize-space(), we are likely to end up with a poem in one long line. The whitespace inside the "p" element and outside the stanza lines are not whitespace only text nodes.

  1. <p xml:space="preserve">
  2.    If I ventured in the slipstream
  3.    Between the viaducts of your dream
  4.    Where immobile steel rims crack
  5.    And the ditch in the back roads stop
  6.    Could you find me?
  7.    Would you kiss-a my eyes?
  8.    To lay me down
  9.    In silence easy
  10.    To be born again
  11.    To be born again
  12. </p>

In the example above the white space characters are next to other text characters. For that reason the white space are not whitespace only text nodes. In the next example xml:space="preserve" works as expected (if supported):

  1. <p xml:space="preserve">
  2.    <span>If I ventured in the slipstream</span>
  3.    <span>Between the viaducts of your dream</span>
  4.    <span>Where immobile steel rims crack</span>
  5.    <span>And the ditch in the back roads stop</span>
  6.    <span>Could you find me?</span>
  7.    <span>Would you kiss-a my eyes?</span>
  8.    <span>To lay me down</span>
  9.    <span>In silence easy</span>
  10.    <span>To be born again</span>
  11.    <span>To be born again</span>
  12. </p>

In the above example xml:space="preserve" works as expected and we don't need "br" elements after the "span" elements. The white space text nodes inside the "p" element and outside the "span" elements are white space only text nodes. The white space are not mixed with other text characters or are not next to other text characters.

5. Mixed content and xml:space

Whitespace only text nodes are almost always safe to add or to delete. You can add indention and you can take it away. But in mixed content, whitespace only text nodes often matters. Take a look at the following example:

  1. <p>I <b>love</b> <i>Mozart</i>.</p>

Inside the "p" element we have two spaces. A space between "I" and "<b>" and a space between "</b>" and "<i>". Only the last space is a whitespace only text node. But in this case we can't just remove it as we please. If we do, we remove an important space between two words.

To prevent such a whitespace from being removed we can add xml:space to the "p" element:

  1. <p xml:space="preserve">I <b>love</b> <i>Mozart</i>.</p>

In XSLT we have another option: <xsl:preserve-space elements="p"/>.

6. xsl:strip-space, xsl:preserve-space

In XSLT we have two elements made for dealing with whitespace only text nodes, xsl:strip-space and xsl:preserve-space. The first is often used like this: <xsl:strip-space elements="*"/>, meaning that whitespace only text nodes must be deleted from all elements. In xsl:preserve-space we can specify a list of exception elements where whitespace only text nodes must be preserved, e.g.: <xsl:preserve-space elements="p td"/>, meaning that whitespace only text nodes must be preserved inside "p" and "td" elements.

It is all about whitespace only text nodes. The xml:space="preserve" attribute is trying to solve the same problem as xsl:preserve-space.

One XSLT processor, AltovaXML used in XMLSpy, has not implemented xsl:strip-space and xsl:preserve-space. In XMLSpy whitespace only text nodes are always stripped making that processor a bad choice if you work a lot with mixed content. The "xml:space" attribute could be considered an XMLSpy developer's best friend!

7. Is xml:space supported?

A web browser is likely not to support xml:space. It is not even supported in a dedicated XML Editor like XMLSpy 2010! You can pretty-print any XML document. Other XML Editors like Oxygen and Stylos Studio 2010 have implemented xml:space. General programming code manipulating XML is most likely unaware of xml:space. XSLT processors like Saxon and AltovaXML are supporting it.

8. How hard can it be

Just a last example to show how difficult it can be to get xml:space right. In the XML schema behind wordprocessingML as created by MS Word 2003 when saving to XML, some MS developer has forgotten to delete an xml:space="preserve" attribute in the top-element of the schema itself being XML.

When the schema is opened in XML editors recognizing xml:space, it is impossible to pretty-print the schema. Without pretty-printing, the schema does not make sense to the human eye. The trick is to delete the xml:space attribute before pretty-printing!

Updated: 2011-08-04