Jesper Tverskov, March 15, 2006

Tricky whitespace handling in XSLT

The xsl:strip-space and xsl:preserve-space elements are only relevant for whitespace-only text nodes. Some XSLT processors have not even implemented these elements but strip such nodes themselves.

Whitespace consists of one or more space (#x20) characters, carriage returns (#xD), line feeds (#xA), or tabs (#x9). Non-breaking space, " " or " ", is not considered whitespace in this context.

Whitespace can be used for indentation to make XML structures look nice, and we want at least one space character between words to keep them apart. [1]

There are typically two situations where we want to get rid of whitespace:

  1. In a text string there can be too many spaces or "holes" here and there we want to reduce to just one space between each word. We can use a function called normalize-space() for that job.
  2. Sometimes we want to get rid of indentation, new lines, etc., inside elements having only other elements as content. For this we can use the <xsl:strip-space/> element. If exceptions to general stripping is needed we can use the <xsl:preserve-space/> element.

1. Whitespace in a text string

If we look at the "description" element below, it would be nice to get rid of not significant whitespace. We have so-called leading whitespace before "This", we have trailing whitespace after "test.", and we have too many spaces between "this" and "is".

<description> This    is a test. </description>

The xsl:strip-space and xsl:preserve-space has nothing to do with such whitespace problems. The two elements are only relevant for whitespace-only text nodes. If just one non-whitespace character is present, the two xslt elements are irrelevant.

1.1 normalize-space()

If we want to get rid of insignificant whitespace in the "description" element above, we could use the normalize-space() function:

normalize-space(description)

The normalize-space() function replaces tab, carriage return and line feed characters with the space character, and replaces adjacent space characters with just one space character, and deletes leading and trailing space.

1.2 translate() and replace()

If we want to delete example gratis all forms of whitespace, even spaces between words, we can use the translate() function:

translate(description, '&#x20;&#x9;&#xD;&#xA;', ' ')

In XSLT 2.0 we can also use the replace() function. It takes Regular Expressions as an argument.

2. Whitespace-only text nodes

Take a look at the xml file below:

<test>
  <item/>
  <item/>
  <item/>
</test>

The test element contains three "item" elements but it also contains four newline and three tab characters. These invisible characters are examples of whitespace-only text nodes.

3. Why strip whitespace-only text nodes

Often an XML input file contains insignificant whitespace like pretty-printed element structures. This is nice to look at, but it adds to the byte length of the file in memory (processing time) and often we don't want to transfer such whitespace to the output file.

This is one good reason to add xsl:strip-space to every single XSLT stylesheet at top-level. Both xsl:strip-space and xsl:preserve-space has a required attribute, "elements", taking as value a whitespace separated list of element names. For xsl:strip-space the "*" is the most common attribute value:

<xsl:strip-space elements="*"/>

When using xsl:strip-space we can still in the xsl:output element or in the xsl:result-document element in XSLT 2.0 use indent="yes" to add indention to the output file if we prefer that.

4. xsl:strip-space and position()

There is another important reason why it is most often a must to get rid of white-space-only text nodes.

Let us take one more look at the xml sample from before:

4.1 Input xml

<test>
  <item/>
  <item/>
  <item/>
</test>

As we said earlier, the "test" element contains three "item" elements but also four newline and three tab characters or ordinary space characters for indention. If we make an XSLT stylesheet using the position() function, we may run into problems, as seen in XSLT stylesheet and Output XML next

4.2 XSLT stylesheet

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:template match="/">
    <test>
      <xsl:apply-templates/>
    </test>
  </xsl:template>
  <xsl:template match="item">
    <no>
      <xsl:value-of select="position()"/>
    </no>
  </xsl:template>
</xsl:stylesheet>

4.3 Output XML

<test>
  <no>2</no>
  <no>4</no>
  <no>6</no>
</test>

What we want is output saying: 1, 2, 3, but the whitespace-only text node have also been counted. Similar problems can occur when count(), last() and xsl:number is used. If our XPath expressions are specific or we explicitly select only elements in our processing, we will not run into these problems.

But if we rely on default values for the match attribute in xsl:apply-templates or rely on the build in templates in XSLT (which we should do as much as possible) the above problem is bound to pop-up, if we don't strip whitespace-only text nodes.

5. xsl:strip-space and xsl:preserve-space

Now we know what xsl:strip-space and xsl:preserve-space where made for. In many XSLT stylesheets we use <xsl:strip-space elements="*"/> as top-level element to get rid of all whitespace-only text nodes.

When needed we also use <xsl:preserve-space elements="…"/> to make a whitespace separated list of element names where whitespace-only text nodes should not be stripped.

6. Whitespace-only in mixed content

In so-called mixed content, whitespace-only text nodes are rare but can be really tricky if they exist. Take a look at this markup sample:

<para>I <strong>love</strong> <name>Mozart</name></para>

The space character between </strong> and <name> is a white-space only text node (the only one in the sample), but if we remove it we also remove the space between "love" and "Mozart".

That is why it is very common also to use xsl:preserve-space if xsl:strip-space is used. We don't want whitespace-only text-nodes to be stripped from mixed content. A whitespace separated list of elements with mixed content could look like this:

<xsl:preserve-space elements="resume headline para"/>

7. XML Editors and XSLT processors

Since it is nice to get rid of whitespace-only text nodes, at least in elements containing elements only, many XSLT processors do it themselves before the XSLT stylesheet takes over. IT sounds nice but it is making xsl:strip-space and xsl:preserve-space irrelevant, and we loose control. [2]

The XSLT processor build into XMLSpy and XML Stylus Studio does not support xsl:strip-space and xsl:preserve-space. Whitespace-only text nodes are always stripped. Even in mixed content where we need it. This can not be overriden by using xsl:preserve-space. What is striped before the stylesheet takes over cannot be reintroduced by the styleheet.

Microsoft's XSLT processors, MSXML and .NET, also strip whitespace-only text nodes but only as default. When using MS XSLT processors in code like Visual Basic or C# we can turn the default behaviour off with the preserveWhitesSpace property.

When Microsoft's XSLT processors are used in XMLSpy only the default behavior always stripping whitespace-only text nodes is possible. In Oxygen and in XML Stylus Studio, the default behaviour of Microsoft processors can be turned on/off. Off is default that is default is not default, just to confuse us!

Most other XSLT processors like Saxon and Xalan work as they should to give us control and xsl:strip-space and xsl:preserve-space meaning. With these XSLT processors you must almost always use xsl:strip-space and xsl:preserve-space to control whitespace-only text nodes.

8. xml:space="preserve"

The XSLT processors stripping whitespace-only text nodes in advance, like XMLSpy, Stylus Studio, and the XSLT processors of Microsoft, also strips whitespace-only text nodes in mixed content where it should have been left alone!

There is no way to undo this in the XSLT stylesheet. The xsl:preserve-space element will not work, the whitespace-only text nodes are gone for ever. To keep such whitespace you must change the XML input file. You must detect each and every instance of whitespace that should be preserved and insert an xml:space="preserve" attribute in its parent element or ancestor. [3]

Or you must do dubious tricks like inserting a whitespace inside one of the two elements that needs to be kept apart like: "love </strong><name>Mozart". Notice the space between "love" and </strong>.

9. Priority of strip and preserve space

If the xml:space attribute is used in the whitespace-only text node's parent node or in one or its ancestor nodes, the value of the nearest xml:space attribute rules the waves. If it is "preserve", the whitespace-only text node is preserved. If it is "default", it depends on the processor or on xsl:strip-space and xsl:preserve-space.

If the XSLT stylesheet imports or includes other XSLT stylesheets, some using xsl:strip-space and some xsl:preserve-space, the conflict is solved after the standard conflict rules between templates. Imported stylesheets have lower priority, etc. For more details see the XSLT Recommendation:

http://www.w3.org/TR/xslt#strip

10. Whitespace in the XSLT stylesheet

In the XSLT stylesheet itself whitespace-only text nodes are always stripped unless the xsl:text element is used or the xml:space attribute.

To get whitespace-only text nodes preserved, the use of xsl:text is the standard method. But is should only be used when necessary. In the following example the xsl:text element is necessary to get a space between firstname and lastname.

<xsl:value-of select="firstname"/>
<xsl:text> </xsl:text>
<xsl:value-of select="lastname"/>

The xsl:text element can be necessary at the end of lines to prevent the newline character from being stripped as a whitespace-only text node. In the following example each value will be on a new line because the invisible newline character is honored standing next to the comma.

<xsl:for-each select="item">
  <xsl:value-of select="."/>,
</xsl:for-each>

In the following example output will be one long line. The xsl:text element between the comma and the invisible newline character, make the newline character a whitespace-only text node and it is stripped from output.

<xsl:for-each select="item">
<xsl:value-of select="."/>, <xsl:text/>
</xsl:for-each>

The above can also happen if output is XHTML using the <br/> element:

<xsl:for-each select="item">
  <xsl:value-of select="."/>, <br/>
</xsl:for-each>

The above will create new lines in the output when seen in the browser but not in the source code. If we also want new lines in the source code we can do this:

<xsl:for-each select="item">  <xsl:value-of select="."/>, <br/>&#160;</xsl:for-each>

The non-breaking space means that the invisible new line character is no longer stripped. The invisible newline character now stands next to a non-whitespace character (&#160;). it is no longer a whitespace-only text node.

Footnotes

[1]

Some books with good coverage of whitespace:

  1. XSLT 2.0, Programmer's Reference, 3rd Edition, Michael Kay, 2004, pages 136-143.
  2. XSLT and XPath, On The Edge, Jeni Tennison, 2001, pages 419-423.
  3. XSLT Cookbook, Sal Mangano, 2003, pages 150-155.
  4. Effective XML, Elliotte Rusty Harold, 2004, pages 52-58.
[2]

Watch out for quick and dirty tutorials. W3schools.com does not mention whitespace-only text nodes nor that many XSLT processors strip such whitespace themselves setting xsl:strip-space and xsl:preserve-space out of action.

[3]

The xml:space attribute is defined in the XML Recommendation. Only "preserve" and "default" are legal values. "Preserve" is a signal to applications that whitespace matters and should be left alone. XML editors like Oxygen, Stylus Studio and MS Visual Studio honor the xml:space attribute as they should, but not XMLSpy 2006 release 2.

Elements using xml:space="preserve" must not be pretty-printed in an XML Editor, and XSLT processors must not strip whitespace-only text-nodes inside such elements. It is almost never necessary to use the "default" value. It signals that it is up to the application to do what it normally does with whitespace-only text nodes.

Updated 2009-08-06