Jesper Tverskov, Marts 15, 2011
xml:space="preserve" attribute is common in some XML documents. But what the attribute means is obscured by the fact that it is often used for
no good reason. It could be there in an element in the source code because some developer inserted it as an experiment and
forgot to delete it again.
xml:space attribute is defined in the XML Standard (W3C Recommendation), White space handling. Like the only other attribute defined in the standard,
xml:lang, it is only a signal of intent. The
xml:space attribute only matters if the software making use of the XML document recognizes it and acts upon it. This is often not the
xml:space attribute share another important fact with the
xml:lang attribute. If a document is going to be validated with a schema processor, the attribute must be declared in the schema.
The same is true for DTD.
xml:space attribute can only have two values, "default" and "preserve". Since the "default" value in most situations acts as if the
attribute is not used, it is seldom used. Here is the exception: The
xml:space="preserve" attribute applies to the element where it is declared, and in all descendents of that element, in children and their children.
For that reason,
xml:space="default" makes sense if we want to overrule what a descendent element has inherited from an ancestor element.
The spec says that "the value 'preserve' indicates the intent that applications preserve all the white space". What whitespace? If we read the spec carefully there is no doubt that we are not talking about whitespace in general but only about what is called "whitespace only text nodes".
White space only text nodes are indention between elements like linefeed, carriage return, tab and space. Such whitespace characters are only called whitespace only text nodes if they are not mixed with other text nodes or not beside other text nodes. If there is just one character present, not being a whitespace only text node, the whitespace characters are no longer whitespace only text nodes.
Take a look at the following example:
If we transform the above document and use the
normalize-space() function, the leading and trailing whitespace inside the "test" element will be removed, and the consecutive spaces between
"is" and "great" will be reduced to just one space character. "xml:space" is not about whitespace in general.
But indention in the form of linefeed, carriage return, tab and space between elements with no other text nodes around, will
be preserved, like the whitespace between the "root" and the "test" element. Even if we in an XSLT stylesheet say
<xsl:strip-space element="*"/> and
<xsl:output indention="no"/> the whitespace only text nodes in the above document will be preserved if we use an XSLT processor like Saxon.
In the poem below
xml:space does not help us. If the "p" element is manipulated with XML tools using functions like
normalize-space(), we are likely to end up with a poem in one long line. The whitespace inside the "p" element and outside the stanza lines
are not whitespace only text nodes.
In the example above the white space characters are next to other text characters. For that reason the white space are not whitespace only text nodes. In the next example
xml:space="preserve" works as expected (if supported):
In the above example
xml:space="preserve" works as expected and we don't need "br" elements after the "span" elements. The white space text nodes inside the "p" element
and outside the "span" elements are white space only text nodes. The white space are not mixed with other text characters
or are not next to other text characters.
Whitespace only text nodes are almost always safe to add or to delete. You can add indention and you can take it away. But in mixed content, whitespace only text nodes often matters. Take a look at the following example:
Inside the "p" element we have two spaces. A space between "I" and "<b>" and a space between "</b>" and "<i>". Only the last space is a whitespace only text node. But in this case we can't just remove it as we please. If we do, we remove an important space between two words.
To prevent such a whitespace from being removed we can add
xml:space to the "p" element:
In XSLT we have another option:
In XSLT we have two elements made for dealing with whitespace only text nodes,
xsl:preserve-space. The first is often used like this:
<xsl:strip-space elements="*"/>, meaning that whitespace only text nodes must be deleted from all elements. In
xsl:preserve-space we can specify a list of exception elements where whitespace only text nodes must be preserved, e.g.:
<xsl:preserve-space elements="p td"/>, meaning that whitespace only text nodes must be preserved inside "p" and "td" elements.
It is all about whitespace only text nodes. The
xml:space="preserve" attribute is trying to solve the same problem as
One XSLT processor, AltovaXML used in XMLSpy, has not implemented
xsl:preserve-space. In XMLSpy whitespace only text nodes are always stripped making that processor a bad choice if you work a lot with mixed
content. The "xml:space" attribute could be considered an XMLSpy developer's best friend!
A web browser is likely not to support
xml:space. It is not even supported in a dedicated XML Editor like XMLSpy 2010! You can pretty-print any XML document. Other XML Editors
like Oxygen and Stylos Studio 2010 have implemented
xml:space. General programming code manipulating XML is most likely unaware of
xml:space. XSLT processors like Saxon and AltovaXML are supporting it.
Just a last example to show how difficult it can be to get xml:space right. In the XML schema behind wordprocessingML as created
by MS Word 2003 when saving to XML, some MS developer has forgotten to delete an
xml:space="preserve" attribute in the top-element of the schema itself being XML.
When the schema is opened in XML editors recognizing
xml:space, it is impossible to pretty-print the schema. Without pretty-printing, the schema does not make sense to the human eye. The
trick is to delete the
xml:space attribute before pretty-printing!