Understanding HTML5 Validation

One of things that we need to get used to when making the switch from HTML4/XHTML to HTML5 is the way HTML5 validation works, because it’s drastically different from what we’ve become accustomed to in previous iterations of web markup.

First, it should be noted that the W3C’s HTML5 validation engine is “experimental”, so it’s a work in progress that will likely see many changes over the next year or more. Also, we shouldn’t refer to it as a “validator” anymore; it’s now more accurately referred to as a “conformance checker” (although for simplicity I’ll be using the term “validation” and its derivatives).

Thus, when you validate a page, the following warning is given:

The validator checked your document with an experimental feature: HTML5 Conformance Checker. This feature has been made available for your convenience, but be aware that it may be unreliable, or not perfectly up to date with the latest development of some cutting-edge technologies.

That having been said, let’s compare validation results using the same code for both HTML5 and XHTML. Here’s the code we’re going to validate in HTML5 and XHTML:


<!DOCTYPE html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>HTML5 Validation</title>
<link rel="stylesheet" href="style.css">
<script></script>
</head>

<embed>

Text Snippet #1<br>

<p>

<p>Text Snippet #2</P>

<FOrM>

<input>

</form>

<textarea></textarea>

<a href=index.html target="_blank"><div>& Text Snippet #3</div></a>

When we switch to XHTML, we’ll make one change: We’ll add the proper doctype to identify an XHTML strict doctype.

Just to make something clear: I’m not doing this comparison in order to imply that HTML5 is better or that XHTML is too strict. The purpose of this experiment is to help us understand what direction HTML5 validation has now taken.

HTML5: 0 Errors; XHTML: 23 Errors

The code shown above is (believe it or not) 100% valid HTML5. The only warnings given by the HTML5 validator are those that are given when validating virtually any script (the warning I mentioned above and another warning related to direct input). But there are no reported errors (using Validator.nu or W3C Markup Validator).

On the other hand, if you take the same code and validate it using XHTML (changing the doctype), the W3C validator will print 23 validation errors.

Validation Comparison

For reference, below you’ll find the code I’m using for XHTML validation. It’s exactly the same as the code example above, except it has the XHTML strict doctype and the meta tag has been changed. Go ahead and copy the script and try validating it (use the “view plain” link for a copy-able version):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>XHTML Validation</title>
<link rel="stylesheet" href="style.css">
<script></script>
</head>

<embed>

Text Snippet #1<br>

<p>

<p>Text Snippet #2</P>

<FOrM>

<input>

</form>

<textarea></textarea>

<a href=index.html target="_blank"><div>& Text Snippet #3</div></a>

If you look carefully at the code, you’ll see a whole slew of seemingly atrocious code mistakes. Here’s a list of all the problems that the code has from the standpoint of XHTML validation:

  • No <html> element
  • The <meta> element is not closed
  • The <link> element is not closed
  • The <script> element doesn’t have a type attribute
  • No <body> tag
  • A nonstandard <embed> element is used, and it’s not closed
  • Stray text (i.e. “character data”) with no paragraph or other required parent element
  • A <br> element with no self-closing slash
  • A paragraph element with no closing tag
  • A closing paragraph element in uppercase
  • A form element in mixed case with no action attribute
  • A stray non-closed <input> element that’s not wrapped in a <div> or other required parent element
  • A stray <textarea> element with missing rows and cols attributes
  • An anchor element with an unquoted href attribute
  • A deprecated target element on the anchor
  • A block-level element (<div>) nested inside an inline element (<a>)
  • An ampersand that’s not coded as a special entity
  • No closing <body> and <html> tags

As you can see there are quite a few problems in that document that the XHTML validator flags as errors, while the HTML5 validator has no problem with any of those things listed, and gives the user the feel-good green screen that we all know and love. While it would be beneficial to discuss a number of these “errors” that are now acceptable in HTML5, that’s not the purpose of this article, so I’ll leave those for another time.

What Accounts For These Differences?

The reason there’s such a big difference is simple: HTML validation is now separated from “linting”. A validator should not throw errors for code styling inconsistencies, but should only throw errors for, well, code errors. Thus, developers have been asking for HTML lint tools to aid us in creating consistent and maintainable code. At least one such tool is now available for use, but I’m not completely sure of the quality of the tool, so use at your own discretion.

Also, HTML5 is designed to be backwards-compatible, so it will conform to both HTML4 and XHTML coding styles. Jeffrey Zeldman alluded to this feature of HTML5 when he wrote that the oldest web document is almost valid HTML5.

What Does This Mean?

The fact that the validators don’t spit out any errors does not mean the code is good. Developers should still endeavor to adopt consistent coding methods to keep their code clean and organized. Thus, I’m not trying to discourage developers from paying attention to their coding style, but instead helping us recognize that the validator is now concerned only with real markup errors.

So what do you think? What are your thoughts on this direction in HTML5 validation (or conformance checking) compared to HTML4 and XHTML?

Advertise Here

31 Responses

  1. Jon L:

    what is linting?

  2. This is a great news for developer that had to follow useless rules just to comply with the customer requests.
    Now your customer will see a “HTML5 valid” badge on his site, will be happy about this (HTML5 is cool, ya know?) and you will have more time to do the real job.

  3. If every HTML5-able browser renders that crappy code exactly the same way I can’t say anything against this kind of validation. But for me personally it’s a step in the wrong direction. I mean it should be obvious that this piece of code is really bad to maintain because i.e. you simply don’t know where the tags will end. Okay you can find it out, but only in this extremly basic example…

  4. I use HTML5 now but I still mostly use XHMTL strict syntax… old habits die hard and I think generally a smarter, uniform code structure across the page and website results in more readable and manageable code

  5. Steve Wilcoxon:

    Personally, I think the article is a perfect explanation of many of the failures of HTML5 – it should have been XHTML2 instead. XHTML is much easier to parse (any XML or SGML parser will work) and thus much easier to write code to process (both in the browser and for tools that scrape web sites for info).

    • I guess the problem is that HTML is not a programming language; it’s a markup language. It is responsible for describing data. If it’s strict (as XHTML is), then it suffers from the problem of not being backwards-compatible, and being less accessible.

      In an ideal world, you’d probably be right, and XHTML 2 would be the best choice, but the internet is far from ideal, so a more lenient markup language is really the only viable solution.

      • M. Wilson:

        I know that it is not a syntactical error, technically speaking. But there obviously is an accepted industry standard when it comes to what is well formed tagging. Therefore the only reason why malformed tags are accepted is because browsers aren’t designed to care. But I guess that opens a whole other can of worms that goes beyond this discussion.

        Thanks for the response.

  6. M. Wilson:

    I find this a bit problematic, actually. Here is why. Programming languages (of which HTML is not) have several types of errors; one being a Syntax Error. Why does HTML deviate from practical norms by suggesting that <FOrm> (vs. <FORM> or <form>) is a “stylistic” choice on the part of a coder? What advantage does industry gain from this? I always thought that coding style came into play in how a coder constructed their code, not how they write a method or object when instantiated. And yes, I think that <form> (& most other tags) is very similar to an object–especially since the tags provide meaning to the content in HTML5.
    Someone please edumacate me.

    • But is <FOrm> a “syntax” error? There is nothing in that construct that makes the browser unfamiliar with the form element, so it’s not an error, it’s a style issue.

      Whereas in the case of, say, JavaScript, in order to write (for example) the built-in function “setInterval”, you have to use the proper case. The browser’s scripting engine will not recognize “setinterval” (notice the case difference) because that could be a custom object instead.

      So again, as I mentioned in the article, the HTML5 validator is now not concerned with “linting” your code but is only concerned with actual code errors and things that cause the content or markup to potentially be unrecognizable by the user agent.

      • M. Wilson:

        Sorry, I responded to the wrong post earlier. *blush*

        I suppose, then that this is more of a values and attitudes type of situation, rather than a pure technical one. Call me rigid, but I look at <FOrm> as a mistake that affects the integrity of the code from a productivity standpoint, even though the browser doesn’t care.
        However, as I have thought more about this, I am coming to appreciate the idea that the validation process cares nothing for how I code, and only about what I code. But coming from more rigid environments where how you code was just as important as what you code, I am sure that fully digesting the validation process of HTML5 is going to take a me little longer.
        Thanks for all of the comments, here. I see great value in the various perspectives.

  7. Tony Legrone:

    I understand the case sensitivity not bein a big deal in HTML. However, I do think it’s a bad idea to allow un-closed tags to run loose all over the Internet.

    That opens the door to misinterpretations by browsers and confusing maintenance.

  8. Paul:

    Wrong direction i think. It means any dufus can now write crap markup and pollute the internet with their rubbish.

    • M. Wilson:

      Paul, I think this is already the case. I am not so sure that HTML5 validation will contribute any more to the problem. And if the browser doesn’t care, then 9 time out of 10, neither will the user. Strict coding style becomes, almost solely, self-imposed. Now those of us who close all tags, use lowercase tags, etc. can just know that we are better than everyone else but really have nothing to show for it.
      That last sentence was kind of a joke, btw.

  9. Michael Kelly:

    While it’s possible, perhaps even likely, that assistive technologies will eventually be able to handle “conformant” HTML5, for the time being I’m going to assume that the WCAG 2.0 recommendation that tags be closed still holds (http://www.w3.org/TR/2008/WD-WCAG20-TECHS-20081103/H74).

  10. Having moved to Web programming 12 years ago from the publishing software that invented GML (Guthenberg, Ventura …), XHTML always seemed to me like dictatorial approach to HTML syntax written by someone with OCD (obseSSive COmPulsIVE DISORDER). Flexibility is far more desirable than uniformiity and I identify with all the browsers that acted liberally than those that did not.

    If you want to close all your tags and conform to other XHTML standards then it should be your own responsibility for which you can use an XHTML syntax checker and should not be imposed by a standards convention in the browser. Brevity can be just as much a reason for non-standardization (eg removing attribute value quotes). In addition even XHTML doesn’t resolve some of the most problematic issues of standards – eg: the “disabled” attribute which has not value, or custom attributes – how are these supposed to be referred to in js?).

    Standards are often more important in javascript although no official body has set them – and once again never to be imposed. I never create a js variable without prefixing it with its location or type (ls_ = local string, aint_=integer argument of a function, gb _ = global boolean, of_ = function in an object library, etc, etc). There are IDEs iin which you can define your own rules (Visual Studio has limited ability to do this).

  11. The subject of this discussion reminds me of a very funny incident that happened to me about 20 years ago and has direct bearing on the comments here.

    I worked for a technical writing company that was assigned the contract for translating into Hebrew the manuals of Windows and OS/2 (which had just come on to the market and which now included a mouse). Together with a colleague we encountered a problem right at the outset – there was no word for “click” in Hebrew and we had to describe the mouse button actions. “Press” was no good because it was used for the function keys on the keyboard and might confuse the new users who had never seen a mouse.

    Then we made a fatal error – Israel is like W3C – it has an Institute of the Hebrew Language which tries to impose standards to prevent the language being adulterated with foreign words. So before continuing we contacted the Institute to enquire if a word for “click” had already been imposed upon us and a very serious language professor came on the line. Our suggestion was to use the word “click” in Hebrew, which like most Hebrew words could have 3 root letters (klk). From the other side of the line a furious voice yelled at us: “The telephone scandal will not be repeated!” It took us some while to understand the comment, but then it dawned on us that Hebrew had been adulterated with that terrible foreign word “telephone” and there was even a verb “leTaLFen” – to telephone.

    We burst out into laughter and I am wary to admit that my colleague and I bear the responsibility with adulterating the Hebrew language with “click” for which there is no biblical equivalent.

  12. Jecc:

    I just learned something new: anchor’s target attribute was deprecated in HTML 4.01 (and consequently XHTML) in favour of rel="external".
    It’s no longer deprecated in HTML5. I guess they realized no one knew or paid attention to the deprecation.

  13. Doug P:

    To say that there are 23 errors in the HTML code because the XHTML validator says so is a bit like saying that I misspelled flavor, judgment, and tire because you’re British and you spell these words as flavour, judgement and tyre. HTML is not XHTML and vice-versa. Yes, they’re similar, but they’re not identical.

    Why does one need to close a <link>, or <break> element? Someone stated above that this would lead to “misinterpretations by browsers” and another said that “this piece of code is really bad to maintain because i.e. you simply don’t know where the tags will end. Okay you can find it out, but only in this extremly basic example…”

    Neither of those statements is true of correctly formed HTML5 code. It is dead simple to know where a <br> element ends, for example. Same with a <link> or <embed>. These are self-closing elements because, well, because everything they need is enclosed in the element itself. If you can’t figure out where, for example, a <br> element ends, I might suggest that web markup not be your career. I know of no browser that has a problem with that either. The only time browsers have a problem is when people don’t follow the rules. The fact that HTML rules are a little easier and less obsessive does not make them any less valid.

    HTML5 doesn’t require closing tags where the closing can easily be deduced. It does require them elsewhere. To me this is perfectly logical. Require something only when it’s actually required. Don’t require it when it’s not necessary to do so.

    Although HTML5 doesn’t require the script element to have a type declaration this doesn’t mean you can’t put one in if you want. Not sure why it matters, though. When’s the last time your script type declaration was anything but type=”text/javascript”?

    I think that is their point. Why add useless fluff to the page? Get rid of those things that don’t actually do anything and concentrate on improving those that do.

    • For the most part, I agree, I think you made some good points.

      Keep in mind though that HTML5 validation now allows <p> tags to not be closed, and that can be a little unsettling if you see some paragraph tags closed, and others not. Also, you’re allowed to have stray text with nothing wrapping it, which again can be confusing and who knows how it affects SEO or accessibility (if at all).

      Also, while it is true that HTML5 has gotten rid of a lot of fluff, the question is: why did it take so long? It would have been nice if these things had been realized 8 years ago. Of course, a lot of the decisions that have gone into HTML5’s spec are largely because of what direction the web has taken in the past 8-10 years (i.e. the pave the cowpaths principle).

  14. jojomonkey:

    My intro programming professor said it best when he was talking about making sure to use curly braces (in C/C++) for all ‘if’ statements – even if it is a one-liner. ‘Any programmer that doesn’t use a pair of curly braces deserves to be shot’. He was from Texas. I think his lesson applies to closing off HTML tags appropriately.

    • Andy:

      You’ve got it bang on, it’s not a case of ‘does it really need to be closed’, more a case of it’s clear to every developer and every browser what the intention is and surely that is the mark of any good language. After 10 years of developing in html and xhtml and more recently making the switch to java, I’m disgusted by the thought of having to go back to losely typed html. I have this very minute spent time debugging rubbish xhtml and explaining to a numpty that if he’d stuck to the xhtml standards and validated his code then the problem we were debugging would never have arisen.
      Sure html/xhtml is not a true programming language but that is irrelevant, this is a question of making a little extra effort to ensure a quality outcome.

  15. This makes me wonder what validation actually is. To me this looks like invalid code: there’s no html element, there’s elements that don’t exist, elements not closed properly … What do you need to do to get invalid HTML?

  16. It’s step in the wrong direction. The single thing that did the most damage to the quality of code on the Web is IE’s inability to serve application/xhtml+xml – more rigid structure would not only lead people and tools to write better markup, but would also lead to far less head-scratching when people are trying to make DOM manipulation work across inconsistent DOMs (like you’d find when you incorrectly nest elements).

  17. Early in my HTML career, I built a web page for a friend. When I posted it to the server and viewed it in the browser, all I got was the website directory. No matter what I did I could not get the home page to load.

    It took six months before I finally stumbled on the flaw that I has posted Index.html.

    This should not have been a reason for the website to fail. Hurrah for HTML5!

  18. HTML5 is a godawful wrong turn. We need to get to a standard that forces content providers to provide decent code; in this regard HTML5 is an epic fail. The current model means browsers must be larded with code to parse markup written by the American Tourister Gorilla. Therefore everyone must continue to maintain ever more complex browsers capable of the coprophagic act of rendering garbage code.

    I still like XHTML an am disgusted by the loose standards of HTML5.

  19. Gerben:

    You can trick the validator into validation using xhtml5 instead of html5. This can be done by putting the xml prolog before the html. This way you get the strictness of xhtml with the new properties and elements of html5. The error messages are sometimes a bit obscure, because it first tries to parse it as XML, but its workable.

    I created a little bookmarklet that puts the xml prolog in front of the html of the current page and adds the required xmlns attribute, and then sends it to the validator, so you don’t have to add the prolog to the source code. It can be found at http://jsbin.com/oditi4/10/edit . Could be improved though.

  20. Missy:

    Just tried HTML lint on some pages I created especially for testing validators.
    It failed to spot deliberate mis-nesting, and didnt’ report any errors at all when I set the doctype to HTML.

  21. Alicia:

    The lint tool you mention doesn’t seem to be no longer available. Do you know of any others?

    • There doesn’t seem to be anything that I can find. I’m not sure what happened with that tool, but if I find something that works well, I’ll post it here.

      But to be honest, don’t worry about it too much. HTML linting is not that important. Validate your pages, and don’t expect perfection. Focus more on JS linting and CSS. It many more benefits. :)

Leave a Reply

Comment Rules: Please use a real name or alias. Keywords are not allowed in the "name" field. If you use keywords, your comment will be deleted, or your name will be replaced with the alias from your email address. No foul language, please. Thank you for cooperating.

Instructions for code snippets: Wrap inline code in <code> tags; wrap blocks of code in <pre> and <code> tags. When you want your HTML to display on the page in a code snippet inside of <code> tags, make sure you use &lt; and &gt; instead of < and >, otherwise your code will be eaten by pink unicorns.