Jesper Tverskov, January 19, 2008

Identity transformation for XSLT 2.0

The traditional identity template has several shortcomings. The most important are that XML declaration and DTD are not recreated and that default attributes found in DTD are copied to the output. In XSLT 2.0, using saxon-parse() and saxon-serialize(), it is possible to supplement the identity template with extra templates and instructions overcoming all limits and inconveniences.

In this 3.edition of the article, the stylesheet has reached version 2.1. All versions of the stylesheet since version 1.2 has been tested with all valid and invalid files in the "Extensible Markup Language (XML) Conformance Test Suites" (see later). [1]

The identity template is mentioned in the specs for XSLT 2.0, http://www.w3.org/TR/xslt20/#copying, and XSLT 1.0, http://www.w3.org/TR/xslt#copying. The idea is to recreate everything from input to output. This is useful when additional more specific templates are added to overrule the copying for some of the nodes. For an introduction to the power of the identity template, see my article, Identity Template: xsl:copy with recursion.

1. Identity template is not complete

  1. The XML declaration is not recreated.
  2. The DOCTYPE declaration is not recreated.
  3. Default attributes in DTD are added to the output.
  4. Whitespace in prolog are ignored outside comments and PIs.
  5. Leading whitespace is normalized for content in PIs in prolog.
  6. CDATA sections are replaced with their content escaped.
  7. Character entities are replaced.
  8. Whitespace is normalized in attribute values.
  9. The order of attributes are not always the same.
  10. Non significant whitespace is removed.

1.1 The XML declaration is not recreated

We can use the xsl:output and xsl:result-document elements to recreate the XML declaration, and the unparsed-text() function and Regular Expressions to read it. It gives us a lot of extra work if we want to transform many files with different XML declarations.

1.2 The DOCTYPE declaration is not recreated

We can use the xsl:output and xsl:result-document elements to recreate the DTD declaration except for internal subset, but we must use the unparsed-text() function and Regular Expressions to read it. It gives us a lot of extra work if we want to transform many files with different DTD declarations.

1.3 Default attributes in DTD are added to the output

This is an extremely irritating problem when transforming e.g. XHTML to XHTML. We can make extra templates to prevent default attributes from being copied out, but it can be a lot of extra work and you must know the details of the DTDs.

1.4 Whitespace in the prolog is ignored outside comments and PIs

This problem is a minor one, but it can be tiresome to add linefeeds and indention to the prolog of output file just to recreate optimal readability.

1.5 Leading whitespace is normalized for content in PIs in prolog

This problem we can almost always live with. We seldom use indention and linefeeds between PI name and content. [2]

1.6 CDATA sections are replaced with their content escaped

Not a problem most of the time but we can have reasons for preferring to keep the CDATA sections. I like them so I can remind my students what they look like.

1.7 Character entities are replaced

This is most often what we want but sometimes it can be irritating. E.g.: XHTML web designers use   (non-breaking space) between words in headlines to prevent line breaks in small browser windows. When   is replaced by a non-breaking-space looking like a space, " "; you can't control if you have remembered to use non-breaking spaces just by looking at the source code.

1.8 Whitespace is normalized in attribute values

Tabs and linefeeds are replaced with spaces. We can live with that. I am not going to fight for getting such whitespace back.

1.9 The order of attributes are not always the same  

We can live with that. For Saxon 9B I have only noticed this in elements also using one of the special xml attributes like xml:space.

1.10 Non significant whitespace is removed

E.g.: if an attribute looks like this: id = "asdf", the whitespace between attribute name and "=" and between "=" and the first quote is non significant. Or the whitespace between "x" and ">" in "</x >" is non significant. The removal of non significant whitespace is probably what we always want.

2. XSLT 2.0 helps us out

In XSLT 2.0 we can load an XML document as unparsed text and use Regular Expressions to find and replace. This makes it possible to supplement the identity template with extra templates and instructions that all together could be called "the better identity template" for XSLT 2.0.

2.1 XSLT 2.0 processors

At the time of this writing we only have three XSLT 2.0 processors: Saxon, AltovaXML and Gestalt. Only SAXON and AltovaXML are in widespread use. AltovaXML is probably not used much outside XMLSpy, one of the leading XML Editors.[3] Gestalt, as far as I know, is mostly an academic exercise. [4]

2.2 Not pure XSLT 2.0

In my solution for a better identity transformation for XSLT 2.0, I use two Saxon extension functions, saxon:parse() and saxon:serialize(). It makes my solution not pure XSLT 2.0. Functions like saxon:parse() and saxon:serialize() should have been part of the spec to make things easy, even if it turns out that it is possible to do the same with advanced user-defined functions in pure XSLT 2.0. [5]

An alternative solution could be to use temporary files reloaded with the collection() function (works in Saxon and AltovaXML) and with the unparsed-text() function (works in Saxon). But according to Michael Kay, the maker of Saxon, these functions are only allowed to read files created by the same transformation because he has not had the time to prevent them from doing so. [6]

For now I stick to my solution making use of Saxon's extension functions xml:parse() and xml:serialize():

3. Better XSLT 2.0 identity template

The better or true identity template, identity-template.xsl, is an XSLT 2.0 stylesheet with several instructions and templates in addition to the identity template mentioned in the spec.

  1. We start loading the XML document as unparsed text to get hold of the prolog with Regular Expressions.
  2. If XML declaration and DTD are detected we replace them with PIs. Disguising the DTD as PI has an extra benefit: default attributes in the DTD are not copied to the output.
  3. The "&#" is replaced with a restricted word to protect character entities. This feature is optional.
  4. The rest of the ampersands are replaced with a restricted word. Ampersands could be the beginning of entities declared in the now hidden DTD.
  5. To preserve linefeeds and indention in prolog, we wrap the part of the prolog before the top-element in a new element. The part of the prolog below the top-element is also wrapped in a new element.
  6. To preserve leading whitespace (linefeeds) inside PIs in the prolog, a restricted word in inserted to protect the whitespace.
  7. In the old top-element the CDATA sections are replaced by PIs.
  8. We create a new top-element with three children: our new element for the "prolog-before", the old top-element and our new element for "prolog-after".
  9. We then convert our new unparsed text document back to XML using saxon:parse() and we feed it to the traditional identity template.
  10. We use additional templates to delete our temporary elements except for their content. We also convert our temporary PIs back to XML declaration and DTD.
  11. Before we use saxon:serialize() we replace the last of our restricted replacement words back.
  12. We use xsl:result-document to get encoding right.

We end up with a true identity transformation, 100% except for non significant whitespace, normalization of attribute values and for the order of attributes in some situations. We can of cause add additional templates with more exact matching when we want to change something in the output.

4. Additional templates

It is meaningless just to make an identity transformation. It is easier to copy the file! An identity transformation is only interesting if we choose to change something in the output. The ability to add extra templates to handle exceptions most be intact in a solution for a better identity transformation for XSLT 2.0.

In my solution we have full access to XML declaration, DTD declaration, all elements, attributes, PI's and comments. But some of our replacement words give us restrictions when it comes to text nodes and attribute values.

Five of our 14 replacement words are replaced back before the transformation take place but nine replacement words are still there during the transformation, first to be replaced back before the serialization.

4.1 DTD declaration

Three of the replacement words related to the DTD declaration are still active during the transformation but they are not a problem. In a normal identity transformation the DTD is not even recreated:

4.2 CDATA sections

The same goes for CDATA sections. In a traditional identity transformation they are replaced with their content escaped:  

4.3 Leading whitespace in PIs in prolog

Is also not a problem:

4.4 Ampersand and &amp;#

The only real but very minor problems are:   

Ampersands can first be replaced when the DTD is recreated. There could be ENTITY declarations in the DTD. The hiding of the beginning of character entities is only an option you can turn off if you don't need it. They can first be replaced back before serialization if we want to protect character entities.

If we make additional templates we can still detect and modify content in elements and attribute values, but if we want to do something with ampersands or "&amp;#" we must use the replacement words.

5. The biggest challenges

Making a better identity template for XSLT 2.0 is not easy considering that it must be able to handle any XML input document no matter how crazy.

5.1 Restricted words for replacements

In version 2.0 of the stylesheet, I am using 14 restricted words for replacements. It is tested if they are used in the input XML document, giving you a chance to change the parameters if it is necessary.

Some of the replacement words could have been made using xsl:character-map and user-defined UNICODE characters but to make the solution easier to understand and to make it easier to pass in your own restricted words in a uniform way as parameters if necessary, all replacement "words" are words.

5.2 Locating the top-element

It sounds easy but it proved one of the most difficult tasks. I load the input document as XML with the document() function, making it easy to find the name of top-element with "name(document(document-uri(.))/*)". But to locate the top-element, we need to replace everything in the prolog that looks like the top-element. We could have false top-elements in comments, in PIs, and in ENTITY declarations inside the internal subset of a DTD declaration.

How do we locate the end tag of the top-element? We remove the prolog, that is all we have in it (some of it could be after the end-tag of the top-element when it is loaded as unparsed-text), and then we know that the end tag of the top-element must be the very last tag in what is left of the document. To locate it I split the document with the tokenize() function using the end tag of the top-element as splitter. I then assemble the document again with the string-join() function and add a restricted word to the very last end-tag, so I can find it again later.

5.3 Locating DTD declaration

The problem with a DTD declaration is not as much that we can have what looks like DTDs inside comments, inside Processing Instructions and inside CDATA sections, but that the DTD can have an internal subset containing comments, PIs and ENTITY declarations that can contain any piece of markup looking like comments, PIs, and DTD declarations.

The identity transformation will ignore the DTD declaration unless we hide it as a PI, but a PI can not contain other PIs, not even the end of a PI, "?>". When we replace the DTD with a PI we must also replace the PIs inside the DTD with a restricted word and replace all question-marks inside the DTD with a restricted word.

The biggest problem is locating the end of the DTD just being a ">" or "]>" or "]…>". Not easy to find when a DTD can contain comments, PIs, and all sorts of declarations that can contain ">" or "]>" or "]…>" all over the place and all ending with a ">".

We must use xsl:analyze-string to take XML declaration, comments and PIs out of the prolog. It doesn't matter if false comments and PIs are found in ENTITY declarations. When everything except what is left of the DTD declaration is removed from the prolog, we know that the very last ">" found in the leftovers must be the end of the DTD. We use tokenize() with ">" as splitter and string-join() function as mentioned earlier, to locate the last ">", and we add a restricted word to it to make it easy to find again.

6. Stylesheet for the better identity template

The stylesheet has reached version 2.0 and has now been tested with the Extensible Markup Language (XML) Conformance Test Suites[7]. The stylesheet transforms all valid and invalid test files making it possible to claim a 100% conformance with the TS, report-identity-template.html. In our case this only means that the transformations took place.

The XML Conformance Test Suites have been a great help but they are not nasty enough to test the proper use of Regular Expressions used for parsing, when talking about far out XML files that probably don't even exist. I have made an additional test, "The Mother Of All Far Out XML Test Files", making it difficult to locate the top-element and the end of the DTD declaration, http://www.xmlplease.com/killer-test.xml.

All test files of the XML Conformance Test Suites and the output files created by the stylesheet have been compared in XMLSpy's "Compare directories" tool. No differences were found except for whitespace normalization in attribute values. In other tests, the only additional differences found have been in non significant whitespace, like </x >" and in the order of attributes in some situations.

<?xml version="1.0" encoding="UTF-8"?>
<!--
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

THE BETTER IDENTITY TEMPLATE FOR XSLT 2.0

Author: Jesper Tverskov
www.xmlplease.com/identity-template.xsl

Version 2.1, 2008-01-19
Version 1.0, 2007-12-15
See end of stylesheet for details about versions.

Using saxon:parse() and saxon:serialize()

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
-->

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:saxon="http://saxon.sf.net/" exclude-result-prefixes="#all">

<!-- Default-encoding parameter should only be changed if the correct encoding can not be detected in the XML declaration. -->
<xsl:param name="default-encoding" select="'utf-8'"/><!-- "'utf-8'" is default but is overruled by encoding in XML declaration. -->

<!-- The default for the "protect-character-entities-test" is "'yes'". Any other value means "'no'". -->
<xsl:param name="protect-character-entities-test" select="'yes'"/>

<!-- The solution makes use of 14 restricted "words". You get a warning if one of the restricted words are used in the input document or if the protected words are not unique. You can change the restricted words below or pass them in as parameters when needed. Note that words must be double quoted or they are interpreted as XPath expressions. Use only a-z to play it safe. -->

<!-- protect-character-entities: If you use &#160; in input you also get it like that in output. Only an option, see above. -->
<xsl:param name="protect-character-entities" select="'qplvfvnhbmngfd'"/>

<!-- protect-entities: This is necessary because we hide the DTD where entities could have been declared. -->
<xsl:param name="protect-entities" select="'ahflasdkffstyor'"/>

<!-- cdata-start: Used when replacing CDATA sections with PI to protect them. -->
<xsl:param name="cdata-start" select="'bvzmntrewrbmn'"/>

<!-- cdata-end: Used when replacing CDATA sections with PI to protect them. -->
<xsl:param name="cdata-end" select="'urqpwoehreregfyuio'"/>

<!-- protect-whitespace: Used to protect leading whitespace in PIs in prolog -->
<xsl:param name="protect-whitespace" select="'rmergfgfhhghgh'"/>

<!-- hide-false-topelements: Before we can locate the start tag of the top-element we must hide what looks like top-elements in the prolog. -->
<xsl:param name="hide-false-topelements" select="'jfsagkytybrtttf'"/>

<!-- hide-xmldecl-as-pi: Necessary to prevent identity template from ignoring it. -->
<xsl:param name="hide-xmldecl-as-pi" select="'potyuitdsgfkadjsbmnl'"/>

<!-- hide-dtddecl-as-pi: Necessary to prevent identity template from ignoring dtd. -->
<xsl:param name="hide-dtddecl-as-pi" select="'mnmnmjjhjhghgh'"/>

<!-- gt-in-dtd: In order to find the last gt in a DTD, that is the end of the DTD, I find all the gt in the DTD first and mark them up as such. -->
<xsl:param name="gt-in-dtd" select="'lapqalqpapqlamalzmalza'"/>

<!-- end-of-dtd: When we have located the end of the DTD after a lot of hard work, we replace it with this parameter value so it is easier to find again. -->
<xsl:param name="end-of-dtd" select="'wquiiiiiiiivbvbbm'"/>

<!-- pi-as-comment: we need it when replacing PIs inside internal subsets of DTD hidden as PI. A PI cannot contain PI. -->
<xsl:param name="pi-as-comment" select="'wqibibibibuxuxxuym'"/>

<!-- questionmark in DTD: DTD is replaced with PI. A PI can not contain "?>". We must replace the questionmark in ENTITY declarations, etc.. -->
<xsl:param name="questionmark-in-dtd" select="'poiipirwzxxzxz'"/>

<!-- end-of-topelement: When we have located the end tag of the top-element, we add a restricted word to it so it is easier to find again. -->
<xsl:param name="end-of-topelement" select="'abtrtrttretewrtertwertwetc'"/>

<!-- questionmark-in-cdata: When replacing CDATA section with PI, it is also a good idea to replace "?" inside CDATA sections because a PI can not contain "?>". -->
<xsl:param name="questionmark-in-cdata" select="'ayrqoeuwiryoewiuqygabc'"/>

<!-- Name of the top-element -->
<xsl:variable name="top-element" select="name(document(document-uri(.))/*)"/>

<!-- To get encoding right in last xsl:result-document. -->
<xsl:variable name="test_encoding" select="unparsed-text(document-uri(.))"/>
<xsl:variable name="xmldeclaration" select="if (starts-with($test_encoding, '&lt;?xml')) then substring-before($test_encoding, '?&gt;') else ''"/>
<xsl:variable name="encoding-test">
  <xsl:analyze-string select="replace($xmldeclaration, '\s', '')" regex="encoding=(&quot;|&apos;).*?(&quot;|&apos;)">
    <xsl:matching-substring>
      <xsl:analyze-string select="." regex="encoding=(&quot;|&apos;)|(&quot;|&apos;)">
        <xsl:non-matching-substring>
          <xsl:value-of select="."/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:matching-substring>
  </xsl:analyze-string>
</xsl:variable>
<xsl:variable name="encoding" select="if ($encoding-test ne '') then $encoding-test else $default-encoding"/>

<!-- Load the input XML document as unparsed-text -->
<xsl:variable name="unparsed" select="unparsed-text(document-uri(.))"/>

<!-- test if top-element is empty. We remove everything in the prolog that could look like the top-element to make it easier to locate the top-element. -->
<xsl:variable name="top-tag">
  <xsl:analyze-string select="="$unparsed" regex="&lt;\?.*?\?&gt;|&lt;!--.*?--&gt;" flags="s"> [8]
    <xsl:non-matching-substring>
      <xsl:analyze-string select="." regex="&lt;!ELEMENT.*?&gt;" flags="s">
        <xsl:non-matching-substring>
          <xsl:analyze-string select="." regex="&lt;!ATTLIST.*?&gt;" flags="s">
            <xsl:non-matching-substring>
              <xsl:analyze-string select="." regex="&lt;!NOTATION.*?&gt;" flags="s">
                <xsl:non-matching-substring>
                    <xsl:analyze-string select="." regex="&lt;!ENTITY.*?[&quot;|&apos;]\s*&gt;" flags="s">
                    <xsl:non-matching-substring>
                          <xsl:analyze-string select="." regex="&lt;!DOCTYPE.*?&gt;" flags="s">
                            <xsl:non-matching-substring>
                              <xsl:value-of select="."/>
                            </xsl:non-matching-substring>
                          </xsl:analyze-string>
                    </xsl:non-matching-substring>
                  </xsl:analyze-string>
                </xsl:non-matching-substring>
              </xsl:analyze-string>
            </xsl:non-matching-substring>
          </xsl:analyze-string>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:variable>
<xsl:variable name="top-tag-2" select="concat('&lt;',$top-element, substring-before(substring-after($top-tag, concat('&lt;',$top-element)), '&gt;'), '&gt;')"/>
<xsl:variable name="is-top-tag-empty" select="if (contains($top-tag-2, '/&gt;')) then 'yes' else 'no'"/>

<!-- We test if the protected words for replacements exist is input document. Will also work for supplied parameter values. We make use of this variable in the first template. -->
<xsl:variable name="restricted-word-test">
  <xsl:choose>
    <xsl:when test="contains($unparsed, $protect-whitespace)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$protect-whitespace"/>, for the "protect-whitespace" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $cdata-start)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$cdata-start"/>, for the "cdata-start" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $cdata-end)">
<xsl:message terminate="yes">The protected word, <xsl:value-of select="$cdata-end"/>, for the "cdata-end" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $protect-entities)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$protect-entities"/>, for the "protect-entities" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $protect-character-entities)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$protect-character-entities"/>, for the "protect-character-entities" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $hide-false-topelements)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="hide-false-topelements"/>, for the "hide-false-topelements" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $hide-xmldecl-as-pi)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$hide-xmldecl-as-pi"/>, for the "hide-xmldecl-as-pi" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $hide-dtddecl-as-pi)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$hide-dtddecl-as-pi"/>, for the "hide-dtddecl-as-pi" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $questionmark-in-dtd)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$questionmark-in-dtd"/>, for the "questionmark-in-dtd" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $gt-in-dtd)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$gt-in-dtd"/>, for the "gt-in-dtd" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $pi-as-comment)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$pi-as-comment"/>, for the "pi-as-comment" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $end-of-dtd)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$end-of-dtd"/>, for the "end-of-dtd" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $end-of-topelement)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$end-of-topelement"/>, for the "end-of-topelement" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
    <xsl:when test="contains($unparsed, $questionmark-in-cdata)">
      <xsl:message terminate="yes">The protected word, <xsl:value-of select="$questionmark-in-cdata"/>, for the "questionmark-in-cdata" parameter was found in the input document. Change the parameter.</xsl:message>
    </xsl:when>
<!-- Here we test if the 14 restricted words used for replacements are unique. -->
    <xsl:when test="count(distinct-values(($protect-entities, $cdata-start, $cdata-end, $protect-whitespace, $protect-character-entities, $hide-false-topelements, $hide-xmldecl-as-pi, $hide-dtddecl-as-pi, $gt-in-dtd, $end-of-dtd, $pi-as-comment, $questionmark-in-dtd, $questionmark-in-cdata, $end-of-topelement))) ne 14">
      <xsl:message terminate="yes">Two or more of the protected words used for replacements are the same. They must be unique. </xsl:message>
    </xsl:when>
  </xsl:choose>
</xsl:variable>

<!-- We replace everything in the prolog that looks like the top-element. -->
<xsl:variable name="unparsed-clean">
  <xsl:analyze-string select="$unparsed" regex="&lt;\?.*?\?&gt;| &lt;!--.*?--&gt;" flags="s">
    <xsl:matching-substring>
      <xsl:value-of select="replace(replace(., concat('&lt;', $top-element), concat('&lt;', $hide-false-topelements)), concat('&lt;/', $top-element), concat('&lt;/', $hide-false-topelements))"/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
<!-- Probably not necessary but we don't want to take chances -->
      <xsl:analyze-string select="." regex="&lt;!ELEMENT.*?&gt;" flags="s">
        <xsl:matching-substring>
          <xsl:value-of select="replace(replace(., concat('&lt;', $top-element), concat('&lt;', $hide-false-topelements)), concat('&lt;/', $top-element), concat('&lt;/', $hide-false-topelements))"/>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
<!-- Probably not necessary but we don't want to take chances -->
          <xsl:analyze-string select="." regex="&lt;!ATTLIST.*?&gt;" flags="s">
            <xsl:matching-substring>
              <xsl:value-of select="replace(replace(., concat('&lt;', $top-element), concat('&lt;', $hide-false-topelements)), concat('&lt;/', $top-element), concat('&lt;/', $hide-false-topelements))"/>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
<!-- Probably not necessary but we don't want to take chances -->
              <xsl:analyze-string select="." regex="&lt;!NOTATION.*?" flags="s">
                <xsl:matching-substring>
                  <xsl:value-of select="replace(replace(., concat('&lt;', $top-element), concat('&lt;', $hide-false-topelements)), concat('&lt;/', $top-element), concat('&lt;/', $hide-false-topelements))"/>
                </xsl:matching-substring>
                <xsl:non-matching-substring>
<!-- Here we can have all sorts of markup fragments -->
                  <xsl:analyze-string select="." regex="&lt;!ENTITY.*?[&quot;|&apos;].*[&quot;|&apos;]\s*&gt;">
                    <xsl:matching-substring>
                          <xsl:value-of select="replace(replace(., concat('&lt;', $top-element), concat('&lt;', $hide-false-topelements)), concat('&lt;/', $top-element), concat('&lt;/', $hide-false-topelements))"/>
                    </xsl:matching-substring>
                      <xsl:non-matching-substring>
                          <xsl:analyze-string select="." regex="&lt;!DOCTYPE.*?&gt;" flags="s">
                            <xsl:matching-substring>
                              <xsl:value-of select="replace(replace(., concat('&lt;', $top-element), concat('&lt;', $hide-false-topelements)), concat('&lt;/', $top-element), concat('&lt;/', $hide-false-topelements))"/>
                            </xsl:matching-substring>
                            <xsl:non-matching-substring>
                              <xsl:value-of select="."/>
                            </xsl:non-matching-substring>
                          </xsl:analyze-string>
                      </xsl:non-matching-substring>
                  </xsl:analyze-string>
                </xsl:non-matching-substring>
              </xsl:analyze-string>
            </xsl:non-matching-substring>
          </xsl:analyze-string>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:variable>

<!-- Prolog before top-element -->
<xsl:variable name="unparsed-before" select="substring-before($unparsed-clean, concat('&lt;', $top-element))"/>

<!-- Top-element including the part of the prolog under the top-element -->
<xsl:variable name="unparsed-after" select="concat(concat('&lt;', $top-element), substring-after($unparsed-clean, concat('&lt;', $top-element)))"/><xsl:variable name="unparsed-after-2">
<!-- We protect CDATA sections replacing them with PI -->
  <xsl:analyze-string select="$unparsed-after" regex="&lt;!\[CDATA\[.*?\]\]&gt;" flags="s">
    <xsl:matching-substring>
      <xsl:analyze-string select="." regex="&lt;!\[CDATA\[">
        <xsl:matching-substring>
          <xsl:value-of select="concat('&lt;?', $cdata-start, $protect-whitespace)"/>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:analyze-string select="." regex="\]\]&gt;">
            <xsl:matching-substring>
              <xsl:value-of select="concat($cdata-end, '?&gt;')"/>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
<!-- We protect "?" by replacing them with a restricted word. -->
              <xsl:value-of select="replace(., '\?', $questionmark-in-cdata)"/>
            </xsl:non-matching-substring>
          </xsl:analyze-string>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:variable>

<!-- Here we find the end of the top-element and add a word to make it easier to find it again. -->
<xsl:variable name="unparsed2seq" select="tokenize($unparsed-after-2, concat('&lt;/', $top-element))"/>
<xsl:variable name="seq2unparsed-except-last">
  <xsl:for-each select="$unparsed2seq[position() ne last()]">
    <xsl:choose>
      <xsl:when test="position() eq last()">
        <xsl:value-of select="concat(., concat($end-of-topelement, '&lt;/', $top-element))"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="concat(., concat('&lt;/', $top-element))"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:for-each>
</xsl:variable>
<xsl:variable name="seq2unparsed" select="concat($seq2unparsed-except-last, $unparsed2seq[position() eq last()])"/>
<!-- Prolog under the top-element -->
<xsl:variable name="unparsed-last">
  <xsl:choose>
<!-- This is necessary for the very rare case of the top-element looking e.g. like this <x ... /> -->
    <xsl:when test="$is-top-tag-empty eq 'yes'">
      <xsl:value-of select="concat('&lt;prolog-after&gt;', substring-after($unparsed-after, '/&gt;'), '&lt;/prolog-after&gt;')"/>
    </xsl:when>
    <xsl:otherwise>
<!-- The regex is to make sure that things get right if top-elements end-tag looks like this </x >. We don't care if insignificant whitespace is deleted. -->
      <xsl:analyze-string select="$seq2unparsed" regex="(^(.*{$end-of-topelement}&lt;/{$top-element}\s*&gt;))" flags="s">
        <xsl:non-matching-substring>
          <xsl:value-of select="concat('&lt;prolog-after&gt;', ., '&lt;/prolog-after&gt;')"/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:otherwise>
  </xsl:choose>
</xsl:variable>
<!-- The top-element only -->
<xsl:variable name="unparsed-between">
  <xsl:choose>
<!-- This is necessary if the top-element looks like this <x ... /> -->
    <xsl:when test="$is-top-tag-empty eq 'yes'">
      <xsl:value-of select="$top-tag-2"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="concat(substring-before($seq2unparsed, $end-of-topelement), concat('&lt;/', $top-element, '&gt;'))"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:variable>

<!-- replace XML declaration with PI -->
<xsl:variable name="xml2pi" select="if (starts-with($unparsed-before, '&lt;?xml')) then concat(replace(substring-before($unparsed-before, '?&gt;'), '&lt;\?xml', concat('&lt;?', $hide-xmldecl-as-pi)), '?&gt;', substring-after($unparsed-before, '?&gt;'))else $unparsed-before"/>

<!-- Replace PI inside DOCTYPE with comment and replace DOCTYPE declaration with PI -->
<xsl:variable name="doctype2pi">
<!-- The regex finds all PIs in prolog before top-element. Non-matching substring finds the rest that is comments, DTD and whitespace between declarations comments and PIs. -->
  <xsl:analyze-string select="$xml2pi" regex="&lt;\?{$hide-xmldecl-as-pi}.*?\?&gt;" flags="s">
    <xsl:matching-substring>
      <xsl:value-of select="."/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
<!-- Finds any comment and PI in prolog and change PI to comments we can find again using a restricted word. -->
      <xsl:analyze-string select="." regex="&lt;\?.*?\?&gt;|&lt;!--.*?--&gt;" flags="s">
        <xsl:matching-substring>
<!-- "\i\c*" is probably overkill, we know already that the input document is well-formed. -->
          <xsl:analyze-string select="." regex="&lt;\?\i\c*">
            <xsl:matching-substring>
              <xsl:value-of select="concat(replace(., '&lt;\?', concat('&lt;!--', $pi-as-comment)), ' ', $protect-whitespace)"/>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
              <xsl:value-of select="replace(., '\?&gt;', concat($pi-as-comment, '--&gt;'))"/>
            </xsl:non-matching-substring>
          </xsl:analyze-string>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
<!-- We find false DOCTYPE inside ENTITY declarations -->
          <xsl:analyze-string select="." regex="[&apos;|&quot;].*&lt;!DOCTYPE" flags="s">
            <xsl:matching-substring>
              <xsl:value-of select="."/>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
<!-- We find the beginning of a real DOCTYPE declaration -->
              <xsl:analyze-string select="." regex="&lt;!DOCTYPE">
                <xsl:matching-substring>
<!-- We change DOCTYPE declaration to a PI. We don't care about the end of the declaration here. -->
                  <xsl:value-of select="concat('&lt;?', $hide-dtddecl-as-pi)"/>
                </xsl:matching-substring>
                <xsl:non-matching-substring>
<!-- We replace ? with a restricted word to prevent trouble. The DOCTYPE disguised as a PI must not contain "?>". We also replace all ">" inside the DOCTYPE declaration with a restricted word. -->
                  <xsl:analyze-string select="." regex="\?|&gt;" flags="s">
                    <xsl:matching-substring>
                      <xsl:value-of select="replace(replace(., '\?', $questionmark-in-dtd), '&gt;', $gt-in-dtd)"/>
                    </xsl:matching-substring>
                    <xsl:non-matching-substring>
                      <xsl:value-of select="."/>
                    </xsl:non-matching-substring>
                  </xsl:analyze-string>
                </xsl:non-matching-substring>
              </xsl:analyze-string>
            </xsl:non-matching-substring>
          </xsl:analyze-string>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:variable>

<!--Create prolog-before element -->
<xsl:variable name="prolog-before">
  <xsl:value-of select="concat('&lt;prolog-before&gt;', $doctype2pi, '&lt;/prolog-before&gt;')"/>
</xsl:variable>

<!-- Concat the substrings back to full document. -->
<xsl:variable name="new-unparsed" select="concat($prolog-before, $unparsed-between, $unparsed-last)"/>

<xsl:variable name="new-unparsed-2">
<!-- $protect-whitespace is inserted before leading whitespace in all PIs to make things easy. In Saxon we only need to do it in prolog. -->
  <xsl:analyze-string select="$new-unparsed" regex="&lt;\?.*?\?&gt;" flags="s">
    <xsl:matching-substring>
<!-- "\i\c*" is probably overkill, we know already that the input document is well-formed. -->
      <xsl:analyze-string select="." regex="&lt;\?\i\c*">
        <xsl:matching-substring>
          <xsl:value-of select="concat(., ' ', $protect-whitespace)"/>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:value-of select="."/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:variable>

<!-- create new top-element: document -->
<xsl:variable name="doc-text">
<!-- Here we find all named entities -->
  <xsl:analyze-string select="$new-unparsed-2" regex="&amp;[^#]">
    <xsl:matching-substring>
<!-- We protect entities declared in a DTD -->
      <xsl:value-of select="replace(., '&amp;', $protect-entities)"/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
<!-- here we deal with character entities. It is optional if they should be protected. -->
      <xsl:value-of select="if ($protect-character-entities-test eq 'yes') then replace(., '&amp;#', $protect-character-entities) else ."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:variable>
<xsl:variable name="doc-text-2" select="concat('&lt;document&gt;', $doc-text, '&lt;/document&gt;')"/>

<!-- When the loaded unparsed-text is serialized at the very end the character-map turns it back into XML -->
<xsl:character-map name="a">
  <xsl:output-character character="&gt;" string=">"/>
  <xsl:output-character character="&lt;" string="&lt;"/>
  <xsl:output-character character="&amp;" string="&amp;"/>
  <xsl:output-character character="&#34;" string="&#34;"/>
  <xsl:output-character character="&#38;" string="&#38;"/><!-- I not sure that this one is really needed. -->
</xsl:character-map>

<xsl:output method="xml" omit-xml-declaration="yes" use-character-maps="a"/>
<!-- This the first template is mostly used to prepare for the identity template below. -->
<xsl:template match="/">
<!-- If true the xsl:message inside the variable terminates the transformation. -->
  <xsl:if test="$restricted-word-test ne ''">
    <xsl:value-of select="$restricted-word-test"/>
  </xsl:if>

<!-- In order to find the end of the DTD we have earlier replaced all &gt; with $gt-in-dtd. We know that the last one is the right one. I have chosen to split the document using $gt-in-dtd as splitter. I then assemble the sequences again and replace &gt-in-dtd with &gt; except for the last one where I add a restricted word so it is easy to locate the end of the DTD later. -->
  <xsl:variable name="doc2seq" select="tokenize($doc-text-2, $gt-in-dtd)"/>
  <xsl:variable name="seq2doc-except-last">
    <xsl:for-each select="$doc2seq[position() ne last()]">
      <xsl:choose>
        <xsl:when test="position() eq last()">
          <xsl:value-of select="concat(., $end-of-dtd, '?&gt;')"/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:value-of select="concat(., '&gt;')"/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:for-each>
  </xsl:variable>
  <xsl:variable name="seq2doc" select="concat($seq2doc-except-last, $doc2seq[position() eq last()])"/>

  <xsl:variable name="that-is-it">
    <xsl:analyze-string select="$seq2doc" regex="&lt;\?{$hide-dtddecl-as-pi}.*?{$end-of-dtd}\?&gt;" flags="s">
      <xsl:matching-substring>
<!-- This is necessary or we get an extra space at the beginning of content in PIs in DTD! -->
        <xsl:analyze-string select="." regex="&lt;!--{$pi-as-comment}\w+\s">
          <xsl:matching-substring>
            <xsl:value-of select="replace(., '\s', '')"/>
          </xsl:matching-substring>
          <xsl:non-matching-substring>
            <xsl:value-of select="replace(., $end-of-dtd, '')"/>
          </xsl:non-matching-substring>
        </xsl:analyze-string>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl:value-of select="replace(replace(replace(., $end-of-topelement, ''), concat('&lt;!--', $pi-as-comment), '&lt;?'), concat($pi-as-comment, '--&gt;'), '?&gt;')"/>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:variable>

<!-- The transformation starts here and is stored in the variable "parsed". -->
  <xsl:variable name="parsed">
    <xsl:apply-templates mode="identity" select="saxon:parse(replace($that-is-it, $hide-false-topelements, $top-element))"/>
  </xsl:variable>

<!-- This result document can be used for testing content of variables, etc. -->
  <xsl:result-document href="identity-template-test.xml" use-when="false()">
    <xsl:value-of select="$that-is-it"/>
  </xsl:result-document>

<!-- Finally the result of the transformation contained in the variable "parsed" is serialised using a lot of replaces but also xsl:character-map. Note that the xsl:result-document is only used to get encoding right. -->
  <xsl:result-document encoding="{$encoding}">

<!--  1. Questionmarks in dtd are replaced back.
  2. PI disguised as comment start is replaced back.
  3. PI disguised as comment end is replaced back.
  4. Questionmarks in cdata are replaced back.  
  5. ]]&gt; disguised as word is replaced back. [9]
  6. &lt;![CDATA[ disguised as word is replaced back.
  7. &amp; disguised as word is replaced back.
  8. &amp;# disguised as word is replaced back.
  9. $protect-whitespace in PI is deleted. [10]
-->

    <xsl:value-of select="replace(replace(replace(replace(replace(replace(replace(replace(replace(saxon:serialize($parsed, ''), concat('\s?', $protect-whitespace), ''), concat('&lt;\?', $cdata-start), '&lt;![CDATA['), concat($cdata-end, '\s?\??&gt;'), ']]&gt;'), $protect-character-entities, '&amp;#'), $protect-entities, '&amp;'), concat('&lt;!--', $pi-as-comment), '&lt;?'), concat($pi-as-comment, '--&gt;'), '?&gt;'), $questionmark-in-cdata, '?'), $questionmark-in-dtd, '?')"/> [11]
  </xsl:result-document>
</xsl:template>

<!-- Traditional identity template -->
<xsl:template match="@*|node()" mode="identity">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()" mode="identity"/>
  </xsl:copy>
</xsl:template>

<!-- XML declaration -->
<xsl:template match="processing-instruction()[name() eq $hide-xmldecl-as-pi]" mode="identity">
  <xsl:text>&lt;?xml</xsl:text>
<!-- Add code to manipulate xml declaration -->
  <xsl:value-of select="."/>
  <xsl:text>?&gt;</xsl:text>
</xsl:template>

<!-- DOCTYPE declaration -->
<xsl:template match="processing-instruction()[name() eq $hide-dtddecl-as-pi]" mode="identity">
  <xsl:text>&lt;!DOCTYPE </xsl:text>
<!-- Add code to manipulate DOCTYPE -->
  <xsl:value-of select="."/>
<xsl:text>&gt;</xsl:text>
</xsl:template>

<!-- We can add all the templates we want to overrule identity copying for a specific node. Remember to use mode="identity". You must use the proper replacement "words", see above, if you want to get to things like entities in the additional templates. -->

<!-- Delete temporary document element except content -->
<xsl:template match="/document" mode="identity">
  <xsl:apply-templates mode="identity"/>
</xsl:template>

<!-- Delete temporary prolog-before element except content -->
<xsl:template match="/document/prolog-before" mode="identity">
  <xsl:apply-templates mode="identity"/>
</xsl:template>

<!-- Delete temporary prolog-after element except content -->
<xsl:template match="/document/prolog-after" mode="identity">
  <xsl:apply-templates mode="identity"/>
</xsl:template>
</xsl:stylesheet>

<!--

ABOUT VERSIONS
* Version 2.1, 2008-01-19
The stylesheet now also handles my http://www.xmlplease.com/killer-test.xml, a far out mess of CDATA sections, PIs, Comments and DTD declarations with internal subsets nested into one another. In version 2.0 one regular expression used at three different places did not work as expected but this problem is now also history.

* Version 2.0, 2008-01-14
Changed a few details making it possible to delete one template making the solution more simple. A few corrections of misspelled words in comments. All templates except the first have now mode="identity". All earlier versions used mode="last".

* Version 1.3, 2008-01-12
Fixed a bug about CDATA sections declared as entity in internal subset of DTD.

* Version 1.2, 2008-01-12
Has been tested against all valid and invalid test files in the XML Conformance Test Suites (2 files were excluded for not being well-formed in XERCES). All files are transformed. This does not say that the result of the transformations are as we want them to be. Input and output files have been compared with "Compare directories" tool in XMLSpy showing that the files are exactly the same except for whitespace normalization of attribute values (a handful of files). In other tests the only additional differences found have been in non significant whitespace and in the order of attributes in some situations.

* Version 1.1, 2007-12-19
Could probably do a good job more than 99% of the time, but it had grave problems with extreme DTD subsets containing PIs, etc., and a bug made handling of CDATA sections not working.

* Version 1.0, 2007-12-15
Could probably do a good job more than 99% of the time for must users but entities declared in DTD was not handled.

-->

Footnotes

[1]

The third edition of the article, published 2008-01-19, has only updated the stylesheet and links to tests have been added. The second edition of the article, published 2008-01-14, has been rewritten inspired by discussions at the XSL mailing list. The first edition of the article was published 2007-12-15.

[2]

In our solution for a better identity template we need to hide XML declaration and DTD declaration as PI. To get linefeeds and indention right when we replace them back, our solution must overcome this "small" problem.

[3]

Using the identity template with AltovaXML removes all linefeeds and indention and even important whitespace only text nodes between markup in mixed content. Altova should provide for a way to turn stripping of whitespace only text nodes off.

[4]

When XSLT 2.0 and XPath 2.0 became standards (recommendations) in late January 2007, Microsoft announced that they would make an XSLT 2.0 processor of their own, http://blogs.msdn.com/xmlteam/archive/2007/01/29/xslt-2-0.aspx. This is probably no longer the case as can be seen here: http://blogs.msdn.com/xmlteam/archive/2007/11/16/chris-lovett-interview.aspx:

"As for XSLT 2.0 - we’ve heard from customers and understand the improvements in XSLT 2.0 over XSLT 1.0, but right now we’re in the middle of a big strategic investment in LINQ and EDM for the future of the data programming platform which we think will create major improvements in programming against all types of data. But we are always re-evaluating our technology investments so if your readers want to ramp up their volume on XSLT 2.0 please ask them to drop us a line with their comments."

[5]

In XSLT 2.0 we got may new functions like distinct-values(), max() and min() relatively easy to make ourselves even in XSLT 1.0.

[6]

Michael Kay in http://www.biglist.com/lists/xsl-list/archives/200712/msg00315.html. This is how it is formulated in the XSLT 2.0 Recommendation at end of 19.1: "It is a recoverable dynamic error for a stylesheet to write to an external resource and read from the same resource during a single transformation, …"

[7]

It has been a great experience to use the "Extensible Markup Language (XML) Conformance Test Suites". When I started using it a handful of files would not transform. Most of the files are only too easy and even irrelevant but some of them are very relevant and tricky. It is not easy to locate the top-element with Regular Expressions in this one: p43pass1.xml, not knowing the file.

[8]

Note the "s" flag to put the "." into dot-all mode also matching newline characters. Also note the question mark after the asterisk meaning "as soon as you get a match that is it and go to the next match". If we don't use the "?" we end up finding the beginning of the first comment and the end of the last comment, et cetera.

[9]

"]]>" can both be the end of a CDATA section or some markup that must not be inside a CDATA section. That is why we use "\s?\?&gt;" inside replace(concat($cdata-end, '\s?\?&gt;'), ']]&gt;').

[10]

In PIs there must be a space between name and content. But we have also replaced CDATA sections with PIs, and in CDATA sections there is no space between CDATA section delimiter and content. That is why we need "\s?" in replace(concat('\s?', $protect-whitespace), '').

[11]

The function saxon:serialize() must have two arguments, the second being the name of the xsl:output element. Using the zero-length string as name makes it choose the xsl:output element not having a name attribute.

Updated 2009-08-06