Jesper Tverskov, September 27, 2012 [1]

Benefits of polyglot XHTML5

HTML5 has so many syntax options that at least some web designers and developers prefer to use a consistent subset. In HTML5 almost anything in both old XHTML and HTML are allowed in the same valid document. With no self-imposed restrictions, HTML5 markup has a tendency to attract dirt attracting more dirt. In this tutorial we don't look at all the nice new features in HTML5. We focus on what basic subset to use.

1. Quick and dirty HTML5

HTML5 as it is, with no additional restrictions on top of the HTML5 schema, is the "quick and most often dirty" choice. HTML parsing is available in any browser. No case-sensitivity and both mal-formed and well-formed markup are allowed. No errors are big enough to stop browsers from trying to show an HTML webpage. Since the mimetype is "text/html" by default, you don't have to think about it. Several other tempting shortcuts are allowed in HTML5, e.g.:

  1. No need to declare namespaces.
  2. No need to quote attribute values if they don't contain spaces.
  3. No need to use end tags for certain elements.

The down side even to valid HTML5 is that since many syntax options are allowed, it is difficult to avoid that all sorts of syntax schemes end up in your pages. They can quickly, even when valid, end up being inconsistent and confusing. Such pages have a tendency to become inefficient to make, maintain and reuse.

1.1 Too much is allowed

The following example shows how many alternatives you have in HTML5 when making a boolean attribute like "checked", "disabled", etc. In XHTML only the two first alternatives are allowed. In XHTML5 (HTML5 parsed with XML parser), the two last versions are, unfortunately, also allowed. [2]

checked="checked"

checked='checked'

checked=checked

checked

checked=""

checked=''

The following markup is confusing but valid HTML5. No case sensitivity. Single tag elements like "br" and "img" can be terminated with a slash, but you don't have to. Elements like "p" and "li" don't need to be closed with end-tags. You can mix all the alternatives as you please in the same document.

  1. <!DoCtYpe hTml>
  2. <HTml>
  3.    <head>
  4.       <tITle>Not tidy but valid!</title>
  5.    </HEAD>
  6.    <boDY>
  7.       <p>Here you go.
  8.       <P><bR/><BR>Single tag elements can have a terminating slash.</p>
  9.       <ol>
  10.          <LI class=green>Butter
  11.          <li CLASs="red">Bread</LI>
  12.          <lI claSS=blue>Milk
  13.       </OL>
  14.    </BodY>
  15. </htML>

Even consistent and nice mal-formed HTML5 can be difficult to read. Most of us will have to make the following markup well-formed in our heads before we understand it. Example is from HTML5 spec, 2011-09-10:

  1. <table>
  2.    <thead>
  3.       <tr><th>ID<th>Measurement<th>Average<th>Maximum
  4.    <tbody>
  5.       <tr><td><th scope=rowgroup>Cats<td><td>
  6.       <tr><td>93<th scope=row>Legs<td>3.5<td>4
  7.       <tr><td>10<th scope=row>Tails<td>1<td>1
  8.    <tbody>
  9.       <tr><td><th scope=rowgroup>English speakers<td><td>
  10.       <tr><td>32<th scope=row>Legs<td>2.67<td>4
  11.       <tr><td>35<th scope=row>Tails<td>0.33<td>1
  12. </table>

The situation worsens dramatically in HTML5, if you don't validate your webpages. Your HTML5 is likely to become so inconsistent and confusing that you never really know how heavily you rely on the browser's ability to fix problems. Not a nice foundation for quality webpages. The day you do need help, it's difficult to find where the problem is. Errors are no longer a question of right or wrong markup, but of how well browsers can fix markup errors.

1.2 Nice to be well-formed

Another downside of HTML5 is that your markup is allowed to be well-formed but most likely end up mal-formed, because browsers don't care–they just fix it for you. The DOM is based on well-formed markup. Webpages that follow the rules of XML are nice because they will be well-formed (in fact, they must be), and can be handled with XML tools.

You can easily do without XML in web design and development, but a good knowledge of XML is a must for any serious work in many areas of software development: publishing, storage, transport layer, web services, configuration files, office file formats. If you don't use XML for webpages, you simply miss a chance to learn XML.

2. The benefits of XHTML

XHTML5 is the sibling to HTML5. "HTML5" is the common brand name for both. XHTML5 has all the strictness and consistency HTML5 is lacking but must be served with the "application/xhtml+xml" mimetype, not supported by old browsers including IE8. [3] Also you must tell the webserver to use XHTML as XML. If you use XHTML5 you introduce additional complexity to your project because you still need to serve HTML5 to the few browsers not understanding "application/xhtml+xml".

2.1 Well-formedness

If you like the idea of quality webpages, you must validate them to see if the markup is used exactly as specified by the standard. But the browsers don't care if markup is valid or not, making it easy to forget to validate. That is why it is such a benefit that XHTML, being XML, must live up to the rules of well-formedness. This level of restrictions is actually checked also by the browsers. If they find a well-formedness error, they must show an error message. [4]

Draconian error handling, showing an error message if a web page is not well-formed, is great if you like the idea of a quality webpage. With the "application/xhtml+xml" mimetype well-formedness is not optional like validation but a must. Well-formedness is a first good step toward quality markup, making additional quality steps, levels of restrictions, much easier and natural to implement like validation and accessibility.

2.2 "application/xhtml+xml" spells quality

We don't know much about a webpage served with mimetype "text/html". It can be valid, it can be utterly junk. We need to inspect the source code and use our eye and judgement to find out.

If a webpage is served with "application/xhtml+xml", we know without having to inspect the file, that it is well-formed, that it uses tidy and consistent markup. Since the use of the XML mime-type takes some extra effort and consideration, the mimetype is also a strong signal that the webpage is very likely also valid.

The XML mimetype is also a signal, that if the webpage is updated one day or next minute, you are not likely to encounter major surprises in the source code.

It is my guess that search engines one day in the future will give higher priority to web pages served with mimetype "application/xhtml+xml". Anything else being more or less equal, such a webpage has a tendency to be of higher quality, even the content, than a quick and dirty HTML page or a just as nice web page not risking Draconian error handling.

3. Bad arguments

Before HTML5 and before Microsoft finally with IE9 started to support the "application/xhtml+xml" mimetype, very few webpages were actually served with that mimetype. Like an echo from good old days, we can still hear some mimetype arguments that are no longer relevant. Some of them never were.

3.1 Incremental rendering of XML

A few years ago it was a disadvantage that XML rendering of webpages was not incremental. It could mean that the user had to wait for the first screen to be displayed until the full page was loaded. This is no longer true. All modern browsers, IE9+, Safari, Firefox, Opera, Chrome, use incremental rendering of XHTML. No matter how big a webpage, both XHTML5 and HTML5 shows the first screen almost instantly.

3.2 Speed differences in parsing

Another old argument was that XML rendering had to be faster than HTML because XML is less complicated with much more strict rendering rules. HTML rendering ought to be slower because the browser most also be able to fix pages, to repair the markup. On the other hand, a HTML page can be made shorter than an XHTML page because you don't need to use end tags for certain elements like "p", etc.

The two arguments more or less cancel one another, and if speed differences remain, they are so small or of little practical interest because both HTML5 and XHTML5 use incremental rendering showing the first screen almost instantly. If the rest of the page loads a fraction of a fraction of a second later for HTML5 than for XHTML5 or the other way round, most of us can live with it.

3.3 Integration of new applications

Before HTML5, it was often argued that one of the benefits of XHTML served as XML is that it one day could open up for easy integration of SVG, MathML and other new web applications that might pop up in the future. XML has a solid but unpopular namespace mechanism to make markup from different XML applications co-exist in the same document.

To make things easy, HTML5 has simply integrated SVG and MathML into HTML. You don't even need to declare the namespaces for them, they must be known by the HTML parser! On the surface HTML5's way of integrating new web applications is not pretty but it works. The method is basically the same as in XML except that the namespaces can be declared both implicitly and explicitly. [5]

Since HTML parsers, being for a specific markup language, have a tradition of developing over time, as new markup was introduced by the browsers or by W3C, it is not a big deal that a new web application requires changes to the HTML parser to be integrated. An XML parser on the other hand, don't know the markup or top-elements of individual applications only the rules of XML as a meta markup language to make markup languages.

4. XHTML as a subset of HTML5

When you want to limit your use of the HTML5 spec that allows too much, it is nice to come up with a subset that is not just your own. You want a sensible subset, ideally a subset that more or less defines itself. We want a subset that could be the natural choice of a community of web developers. Luckily there is a natural subset of HTML5 that restricts it sufficiently with such authority that we don't need to discuss or decide anything. That subset is called XHTML.

But watch out! XHTML is at least four completely different things:

  1. The original XHTML 1.0 and 1.1 standards.
  2. XHTML as a subset of HTML5.
  3. XHTML5 as a sibling to HTML5.
  4. Polyglot XHTML

Since we want to use all the new features of HTML5, the original XHTML 1.0 and XHTML 1.1 standards are no longer relevant for new webpages. If we want to use XHTML as part of the HTML5 brand, we have the following options.

  1. We can use the XHTML subset of HTML5. Since HTML5 uses the XHTML namespace, all we need to do is to declare it explicitly. For the rest we just use the syntax rules we know from the original XHTML specs: the markup must be well-formed, and we use lower-case element and attribute names. [6]

  2. For browsers understanding the "application/xhtml+xml" mimetype, we can use XHTML5 but we need to know how to setup that mimetype with server-side scripting. We also need to test if browsers understand that mimetype and we need a switch in our server-side scripting to choose between our two sets of webpages. One set for old browsers to be served with "text/html" to get HTML parsing and one set for new browsers to be served with "application/xhtml+xml" to get XML parsing.

  3. That is why polyglot XHTML5 is such a tempting solution. A polyglot XHTML5 webpage is one and the same webpage that can be served with "text/html" and with "application/xhtml+xml" without any changes except the mimetype. Also, to be polyglot, the DOM created by the HTML parser and the XML parser must be exactly the same to make sure that JavaScripts work the same.

Another nice thing about polyglot XHTML5 is that it has authority to it. It is specified in a little W3C spec, Polyglot Markup: HTML-Compatible XHTML Documents . Also, if you for some reason decide that it is too much to set up server-side scripting and to switch back and forth between the two mimetypes, or if you are too faint hearted to use Draconian error handling, you can just use "text/html" also for browsers understanding "application/xhtml+xml".

5. Validation

Not most, not a majority, not even many but at least some web designers and developers like the idea of valid markup. They validate their HTML and XHTML web pages. They believe that at least trying to use a markup language as intended, is the best way of learning it, and that valid webpages is the best foundation for team work and efficient web development.

The problem with HTML5 is, that since too much is allowed, we cannot just use a standard HTML5 schema to validate our webpages. We must also use an additional layer of validation to make sure that our HTML5 is the exact subset we want it to be. Until someone provide us with such schemas, we must make them ourselves.

Today we can only validate a HTML5 page as HTML5 or as XHTML5. One of the best online validators is http://www.validator.nu. W3C's HTML5 validator has fewer options for HTML5. Some of the most important options are still missing in both validators:

We need an easy to use option that validates the XHTML subset of HTML5, and we need an easy to use option that validates polyglot XHTML5.

5.1 Valid polyglot XHTML

For a first beginning, David Carlisle has made a Schematron schema for polyglot XHTML5. Inspired by that schema, I have made an XML Schema 1.1 for polyglot XHTML5. This webpage in an example of polyglot XHTML5. Both schemas have as point of departure that the document already validates as HTML5. We need schemas that can do the full polyglot XHTML5 validation in one go to make things easy.

For this third edition of the article, I'm happy to report that the CSE HTML Validator now has excellent support for polyglot XHTML5. I have helped a little. The CSE HTML Validator's validation for polyglot XHTML5 is so complete, that I can't think of any more tests to add! That's inspiring.

5.2 Polyglot XHTML in a nutshell

If you are new to polyglot XHTML, it might seem quite a challenge. But if your document is already valid HTML5, well-formed, and uses lower-case for element and attribute names, you are pretty close. The following 10 points almost cover all of polyglot XHTML:

  1. Declare namespaces explicitly.
  2. Use both the lang and the xml:lang attribute.
  3. Use <meta charset="UTF-8"/>.
  4. Use tbody or thead or tfoot in tables.
  5. When col element is used in tables, also use colgroup.
  6. Don't use noscript element.
  7. Don't start pre and textarea elements with newline.
  8. Use innerHTML property instead of document.write().
  9. In script element, wrap JavaScript in out-commented CDATA section.
  10. Many names in SVG and one in MathML use lowerCamelCase.

The following additional rules will be picked up by HTML5 validation if violated:

It is important to remember that browsers don't care about validation. Even if you have a few polyglot details wrong, the differences in HTML DOM and XML DOM are often not big enough to matter from the end-user's point of view. DOM is mostly important if you use JavaScript making use of it.

6. HTML5 has rescued XHTML

The Internet is damned to be mostly "quick and dirty" or far from perfect. The W3C had a vision of something better. XHTML 1.0/1.1 leading to XHTML 2.0 was to be a new beginning. But the good intentions failed completely: Too strict for a world not even used to validation, not enough backward-compatible and with too few new tempting features.

HTML5 has so to speak surrendered to the bad habits of good old internet. The vast majority of new web pages will also in the future be far from well-formed and valid. But the good guys can still use valid HTML5 or XHTML as a subset of valid HTML5 and XHTML5 with mimetype "application/xhtml+xml" and even polyglot XHTML5.

Here is my prediction for what webpages will look like, 10 years from now:

1

Not valid HTML5

The vast majority of webpages will be not valid HTML5. A fraction of these pages will use a subset of not valid HTML that is also more or less well-formed.

Avoid this crowd.

2

Valid but not well-formed HTML5.

A small minority will use valid but not well-formed HTML5.

Nice option if you don't give a damn about the XML world or if you know too much about it!

3

XHTML as a subset of valid HTML5

A small minority will use valid HTML5 that is also well-formed XML and XHTML except for the mime type.

Nice option if you like the idea that your webpages are also XML.

4

Polyglot XHTML5 but served only with "text/html"

A small minority will use polyglot XHTML5 but serve it with "text/html" even to browsers understanding "application/xhtml+xml".

Nice option because you remain in the "HTML world" with pages that are also XML, ready to be served also with XML mimetype.

5

Polyglot XHTML5 served with "text/html" and "application/xhtml+xml

A small minority will serve polyglot XHTML5 with mimetype "application/xhtml+xml" to the browsers understanding it.

Nice option. Both mimetypes but only one file. Your webpages can not be that bad when you risk Draconian error handling.

6

Non polyglot XHTML5 and HTML5 (two sets of pages)

Even fewer will use non-polyglot XHTML5 to the browsers understanding it and HTML5 to the rest.

Not smart to have two sets of web pages.

Personally, I will use polyglot XHTML5 and switch mimetype as necessary. That is natural if you use XML every day in many other contexts. If you are not that keen about Draconian error handling, I will recommend to use polyglot XHTML5 served with mimetype "text/html".

Good people above average will always be a minority. I don't say that not well-formed and not valid HTML junkies are bad people. But why follow the crowd? It is ok to belong to a minority.

7. MIME type switching is easy

If you don't set the mime-type for a webpage using a script, it defaults to "text/html". To set the mime-type to "application/xhtml+xml" you need to be able to test if the browser requesting the webpage supports it. It can be done by the webserver or by a server-side script for the webpage. I use the last method with C# in asp.net, shown below in a simplified version just to give you an impression.

  1. string a = "myWebpage";
  2. string accept = Request.ServerVariables["HTTP_ACCEPT"];
  3. if (accept.ToLower().Contains("application/xhtml+xml"))
  4.    {
  5.       Response.ContentType = "application/xhtml+xml";
  6.       Response.Charset = "UTF-8";
  7.       Response.WriteFile(a + ".html");
  8.       Response.End();
  9.    }
  10. else
  11.    {
  12.       Server.Transfer(a + ".html");
  13.    }

The above simplified C# code looks more or less the same in most programming languages.

8. Creation, maintenance and reuse

It is easy to make a case for why we should use valid markup. It ought to make it easier to create, maintain and reuse markup in the long run. If markup is well-formed or not is at least on the surface much less important. But it is difficult to deny, that if you don't use MIME type "application/xhtml+xml" to enforce Draconian error handling, your markup has a strong tendency to end up one day not only mal-formed but also not valid.

Let me list a few examples of how I often benefit from well-formed webpages:

All the examples above can be done almost as easily with valid not-wellformed HTML instead of XHTML. But the irony is that if you want to do it with XML tools you need to make the HTML well-formed first.

It is relatively easy to use a tool like HTML Tidy to make mal-formed HTML well-formed automatically. But with many well-formedness errors to correct, doubts creep in, if the fix is actually true to the original document or if we end up with a new slightly different document.

Fixing well-formedness errors in valid HTML poses almost no problems, but if the markup is not valid, there is often more than one solution to a problem. An additional tool (with bugs of its own) no matter how well integrated in your main tool, do add complexity to a project.

9. Disadvantages of polyglot XHTML

The disadvantages listed below are so minor that they hardly can be regarded as such, but judge for yourself.

9.1 It is more difficult

Making not valid HTML is easy. Making valid HTML5 or XHTML5 or polyglot XHTML5 is much more difficult. But we are talking about essential qualifications useful in many contexts. Once learned it ought to be easier and more satisfying doing things right than to repeat the same errors over and over again and rely on browsers to help you out.

9.2 XML editor is necessary

Since your webpages must be well-formed, your Editor must have good support for XML. This is not an issue if you are used to XML tools but could be a problem if you have not used XML before.

9.3 Server-side scripting is necessary

You must set up a script to test what mimetypes the browser requesting your webpage supports, and a code switch that can choose the mimetype accordingly. The code for mimetype switching is only easy to implement, if you already use server-side scripting for URL rewriting or in order to generate the webpage, or if you can set it up on the webserver.

9.4 Restrictions in JavaScript

In XML document.write() is not allowed, use the innerHTML property instead. Also in the "script" element you cannot use "<" and "&" , and you cannot just escape them like in XML (creates different DOM in HTML and XML). But you can use external JavaScripts or you can wrap the internal JavaScript inside a CDATA section in the "script" element, and out-comment the start and end delimiter of the CDATA section like this:

  1. <script>
  2.    //<![CDATA[
  3.       JavaScript goes here
  4.       a < b
  5.    //]]>
  6. </script>

In HTML parsing, what looks like a CDATA section in the "script" element is just characters. In XML parsing the CDATA section is recognized as such and makes the use of "<" and "&" possible inside the JavaScript. The CDATA section's start and end delimiter are out-commented in the JavaScript to prevent the characters from creating errors in JavaScript. [8]

10. Advantages of polyglot XHTML

  1. It gives you the best possible start of a quality assurance system for your webpages. Well-formed documents are already so tidy that other levels of restrictions like validation and accessibility are easy and realistic to implement. [9]

  2. Since your markup is likely to be more consistent and tidy than HTML markup, your markup ought to be easier to make, maintain and reuse in the long run.

  3. Well-formedness is an easy to understand baseline for team work. With HTML parsing only, you risk long discussions of how to restrict HTML5 to a decent consistent subset.

  4. You get the benefits of webpages that can easily be manipulated with XML tools. No need to make the markup well-formed first.

  5. By using XML web pages you get a chance to learn XML, important all over the place: publishing, storage, transport layer, web services, configuration files, office file formats.

  6. If you already use a lot of XML it is nice that you can keep on doing that also for webpages instead of the confusion of also having to use a one-off markup language like HTML with its own unique rules.

  7. Using polyglot markup raises your awareness of differences and similarity between HTML5 and XHTML5, between HTML parsing and XML parsing. You are likely to get a better understanding of how webpages work than if you only use HTML parsing.

  8. Since XHTML webpages using Draconian error handling are likely to be of higher quality than HTML webpages not risking not to be rendered as a useful webpage if they have syntax errors, search engines like Google might one day give them higher priority.

Most web designers and developers will most likely also in the future continue with the BAD PRACTICES of not valid HTML. Many web designers and developers that do validate their webpages will not find the listed additional benefits of polyglot XHTML5 convincing or important enough. Especially if they have no interest in XML and are used to do things very differently.

But it is nice to know that polyglot XHTML5 is an option, if we care about XML, XHTML and quality webpages.

Footnotes

[1]

Third edition. In section "5.1 Valid polyglot XHTML", I have added a paragraph about excellent polyglot validation support in CSE HTML Validator. Second edition published: 2011-09-11. Section about JavaScript corrected with the help of David Carlisle at the XML Developers mailing list. Also sections have been rearranged, and paragraphs added. First edition published: 2011-09-03.

[2]

One thing is that the last two options are allowed in (x)HTML5, but I find it sad that the W3C spec actively promotes all the options.

[3]

Any mimetype ending with "xml" will probably do, but the spec clearly recommends to use "application/xhtml+xml" for XHTML5.

[4]

All major browsers like Safari, Firefox, Opera and Chrome, show an error message if a webpage served with mime-type "application/xhtml+xml" contains well-formedness errors. Instead of showing an error message, IE9 simply renders the page until the error.

[5]

For the sake of making things easy, HTML5 has undermined core concepts in XML. A prefix is no longer a prefix but part of local-name.

[6]

It is nice that HTML5 allows us to use an XML subset similar to XHTML, served with mimetype "text/html". In XHTML 1.0, the spec also allowed us to use "text/html" but only as a temporary hack-like solution that was considered harmful by many web developers. In HTML5, XHTML is a natural subset of HTML5 and in addition we have the option of using XHTML5 for browsers supporting " application/xhtml+xml" .

[7]

Webpages that claim to be XHTML using one of the old XHTML DOCTYPES but served with "text/html" often have well-formedness errors. But they are likely to be easier to reuse than webpages using HTML DOCTYPES. A document where the author at least tried to make it well-formed is often easier to fix than a document where well-formedness was not at all in the author's mind.

[8]

Using the out-commented CDATA section method inside the "script" element doesn't create exactly the same DOM when parsed as HTML and XML but will probably be allowed in polyglot XHTML anyway because what is different has no practical implications. Should W3C end up being a little too academic, we simply allow the CDATA section in our schema anyway.

[9]

You can setup exactly the same QA for HTML parsing. But you need to make your own more or less arbitrary restrictions on top of HTML5 validation, and because the browsers don't care about validation, your webpages are more likely to attract dirt than if you use XML parsing.

Updated: 2012-09-27