> The best feature of SuperHTML is that it is based on the official HTML living specification, but it deviates from it when it makes sense to do so.
I think this is actually a decent design decision. Unlike in networking, where application of Postel's Law (specifically, the "be liberal in what you accept" part) can lead to implementation details becoming de-facto part of the spec, SuperHTML isn't outputting data to another system for further processing - it's giving information (validation errors) to a human for their own use, and if this information leads to humans writing HTML that's a more-robust strict subset of the formal HTML spec, all the better.
> a more-robust strict subset of the formal HTML spec
I still think we’d be better off just using XHTML. There are some practical problems with XHTML5 (e.g. there’s no named entities support for some reason), but the syntax makes sense at least.
That was tried 20 years ago and it turns out that humans are not good at writing XML.
XML makes sense if you are authoring HTML in an editor. However, this is not how most HTML is actually produced. It's mostly produced by templating engines. This means that you can't validate the XHTML during development because it's being generated on the fly. You only find out that it's valid in testing or production, perhaps only for a subset of users in certain situations. With HTML this is OK because there is error recovery. For XHTML you get literal downtime because the entire page shows a WSOD in the worst case.
Yes XHTML is okay as an internal tool, if for some reason your pipeline depends on parsing your own HTML then switching to XHTML internally could be a win. Just don't ship XHTML to browsers.
Surely a template engine would be able to produce valid (X)HTML?
Strict XHTML failed on the web because older browsers could not show it at all (since it used a different mime type) so nobody sane would use it. The problem wasn’t the strictness per se, the problem was how it was introduced without concern for backwards compatibility.
JavaScript is strict in the sense any syntax error will terminate execution. This seems to work fine because there is an incentive to make the syntax valid.
If XHTML was introduced in a backwards compatible way but new features (like canvas) only worked in strict mode, I’m sure it would have caught on. The incentives just have to be there.
IE6’s refusal to display any page served with the XHTML MIME type was certainly the main reason nobody deployed real XHTML, but the overstrictness was not far behind. Hard enough to justify a complete rewrite of your website’s HTML; even harder when any encoding error or tag imbalance generated by your CMS would display the yellow screen of death rather than a best guess or even displaying everything up to the error:
If there was an actual benefit to using XHTML I’m sure CMS’s would be updated to support it. It is not like it is an impossible problem to produce syntactically valid JSON or SVG for example.
As “use strict” in JavaScript shows, it is possible to introduce stricter parsing of an existing format, as long as it is explicit opt-in and existing content is unaffected.
I think the main problem with CMSes supporting XHTML would be that basically every single one uses template engine that treats HTML as a string of characters.
Is there a templating system that’s easy to use (think Jinja or something Svelte-like), but parses templates as XML instead of just concatenating a bunch of strings?
I think if XHTML was pushed forward, the second problem would be swiftly solved: We'd have a lot more systems that considered webpages as XML documents rather than just templated text. And text-based systems could easily validate their XHTML output and report failures quickly, as opposed to now where you get a broken page and have to specifically look if your HTML isn't malformed.
For better or worse XHTML, also known as the XML serialization of HTML, cannot represent all valid HTML documents. HTML and XML are different languages with vastly different rules, and it's fairly moot now to consider replacing them.
Many of the "problems" with HTML are still handled adequately simply by using a spec-compliant parser instead of regular expressions, string functions, or attempting to parse HTML with XML parsers like PHP's `DOMDocument`.
Every major browser engine and every spec-compliant parser interprets any given HTML document in the same prescribed deterministic way. HTML parsers are't "loose" or "forgiving" - they simply have fully-defined behavior in the presence of errors.
This turned out to be a good thing because people tend to prefer being able to read _most_ of a document when _some_ errors are present. The "draconian error handling" made software easier to write, but largely deals with errors by pretending they can't exist.
Clearly not the case as the point of a data language is to free you to pick a programming language to produce it, and a specification to allow agreement without a specific implementation in a particular language.
That’s exactly what happened. We write JSX which gets compiled down to assembly, excuse me, html5 or xhtml or whatever. Fine by me as long as we accept that writing it by hand is not what engineering time should be spent on in overwhelming majority of cases.
(I’d also like a word with yaml while we’re at it…)
Yes, you nailed it with "the deviations only make it more strict" - that's why I think that it's reasonable.
I believe that, in general, when your implementation is deviating from a spec, you should have a good reason to do so, so you don't end up multiplying incompatible implementations - so you should at least think about why you're incompatible. I just think that this is a good reason.
> but it deviates from it when it makes sense to do so
I guess it's ok to take liberties when the "official spec" (WHATWG HTML? which version? a W3C snapshot? older redacted W3C spec? or even MDN as the author is saying elsewhere?) has evolved over the course of ten or more years, but the base and where the model deviates should be documented, shouldn't it?
Doing so might also help with the issues found when this was first discussed [1], such as the bogus hgroup deprecation and possibly related h1-h6 end-element tags. For context, hgroup is not WHATWG-deprecated even though it was never part of W3C HTML until 2021 and marked deprecated according to the author's consulted MDN reference (which is not canonical). What has changed (in a backward-incompatible way!) is its content model due to its role changing from providing a mechanism to prevent HTML outlining in ToC/navigational content as outlining has been removed alltogether.
See the details in [3], and you can use the referenced SGML DTD grammar for checking against that particular (arguably bogus and not anymore deserving of the "HTML 5" moniker) HTML version/snapshot, as well as earlier WHATWG and W3C HTML snapshots/recommendations.
> My favorite example is <li>item<li>, which is unfortunately both valid HTML and also an obvious typo.
If it’s <li>item<li> and then the next line is <li>item</li>, it’s an obvious error, but you certainly need to discriminate in that sort of way. Because if your thing goes complaining when I write <li>item<li>item, I’ll be upset.
(In practice, it would almost always be <li>item and then a next line <li>item; when doing things like horizontal menus, you used to use `display: inline-block` and need to avoid whitespace between list items, so you would find <li>item<li>item in serious code, but now you’d use flex and so whitespace between items gets vanished, so you can write it on multiple lines without it affecting the actual layout.)
> Because if your thing goes complaining when I write <li>item<li>item, I’ll be upset.
It does, sorry. No optional tags period.
Giving whitespace more semantic value than what it already has is a dangerous game. I already use whitespace to let the user "talk" with the autoformater and from my perspective that already spends all my whitespace (and weirdness) budget.
First of all, this is awesome work and it's wonderful to see people taking writing plain HTML seriously, because, as you identified, the existing tooling is completely insufficient.
That having been said, I think the reason people largely don't write plain HTML is because the affordances that HTML makes for writeability (like omitting closing tags on <li>, <p>, etc) are frowned upon. If you're going to make all this effort to understand the HTML in the document, it's kind of a bummer to not make it actually understand some core aspects of HTML, aspects that are especially useful when writing it directly instead of treating HTML as a compile target.
When writing code with others, in existing code bases, I will follow house style. (More faithfully than almost anyone else, because I pay attention to these things, both to see what the style is, and to match it.)
When writing code personally, I’ll omit things like </p>, </li>, </head>, </body>, </html>, <head> (which should never have attributes), and attributeless <html> and <body>.
When writing code with others, if I’m in charge, I’ll frequently retain </p> for others’ sensibilities, but I will still often choose to omit </li> (depending on the type of code, definitely).
I specifically dislike </li> because I’ve seen far more problems caused by its presence than by its absence—incorrect nesting and the likes. Also I reckon it’s good that you end up with no whitespace node between your list items (since any whitespace goes into the previous list item), mostly because of old-style `display: inline-block` and such.
Another instance of this are DNS - which is pretty much always called domain name server server (DNS server)
Though you can at least have an argument there, as the first server could theoretically be the process that's running on the hardware/virtual server (which would be the second server).
even the original rfcs from 198x only write about domain name servers though, which is why the DNS server is still a misnomer of the same kind, even if my attempt of rationalizing it failed - as that's the only thing you've called into question with that.
Because if domain name system server makes sense to you, then language server protocol server has to make sense in the same way, as that's literally the same concept
LSP Server does make sense, the problem is that the title calls it a "HTML LSP", not a "HTML LSP Server". The thing described is a server, not a protocol.
Coming from a different background, LSP to me means “linguistic service provider” (i.e. human language translation). I wasn’t familiar with this LSP [0] at all.
I think this is applying pedantry in the wrong place.
Language Server, you say? Ok, it serves a language. But how? Using LSP? Ok, sounds like an LSPS. The client would be an LSPC. LSP is not the only language an LS can speak, it's overwhelmingly dominant, yes, but the Dart language used to speak its own protocol, and SLIME is also a distinct protocol for what we may reasonably describe as language servers. Similarly, although HTTP(S) is the only hypertext protocol in common use, we don't call it a Hypertext Server, we call it an HTTP server.
I think it's more than alright to elide LSPS as just LSP. At this point that's idiomatic. But my point is that the term is anti-redundant, unlike ATM Machine and other usual suspects in this perennial Internet topic: it elides, rather than duplicates, information.
Yeah, that detail of terminology always bothered me a bit. That said, we do call HTTP servers that, not “Hypertext servers”. Then again, we don’t call an HTTP server “an HTTP”. :)
The specification is so huge and complex. I was interested for a while but that faded fast when I started to dig into it. Also no browser-only support made it a no-go for me. Although I have seem a few cool hacks to run it in a webworker.
LSP server does not have to implement all of the spec to be useful. Just using document change and diagnostics allows to implement basic linter or spell checker.
Yep, the LSP mentioned in the blog post (I'm the author) only implements diagnostics and autoformatting for now. With just that you can already provide a lot of utility.
It doesn't actually, the "server" can (and in many cases does) run in the JS event loop.
Do you want the protocol to specify that language servers are able to run in a browser? Because that's very outside the scope of the protocol, which doesn't constrain the client or server implementations. LSP doesn't define the transport layer between them, just that they should use JSON RPC.
LSP and tree-sitter solve different problems and aren't interchangeable, it sounds like you were trying to pound a square peg into a round hole.
LSP doesn't (nor can it) specify anything that would make your life easier to use a language server in the browser. There are editors that provide clients and language servers written in JS, though.
Do you think this can be applied to HTML fragments? I think it's relatively common (or at least not uncommon) to embed HTML in another language as a string. Would be cool to get LSP functionality in those strings. I'm thinking specifically with something like this:
The example omits the context around those <li>s but you can assume they're inside a <ul>. That's semantically valid html because </li> can be omitted.
But despite being valid and unambiguous for the parser, it can still lead to confusing problems for unsuspecting developers:
<ul>
<li>item1</li>
<li>item2<li>
<script id="this-script">
let ul = document.getElementById("this-script").parentElement;
console.log(ul.tagName); // prints "LI"
// *confused screams by the developer*
</script>
</ul>
Ok, that's a specific rule for <li>, among others for closing tags, that was ignorance on my part, I overlooked the "sometimes" in the "It's valid HTML because the spec allows you to omit closing tags sometimes" comment.
> Is not valid HTML, it's merely valid grammar syntax for a loose parser.
It's an incredible journey writing a spec-compliant HTML parser. One of the things that stands out from the very first steps are that the "loose parser" is kind of a myth.
Parsing HTML is fully-specified. The syntax is full of surprises with their own legacy, but every spec-compliant parser will produce the same result from the same input. HTML is, in a sense, a shorthand notation for a DOM tree - it is not the tree itself.
The term "invalid HTML" also winds up fairly meaningless, as HTML error are mainly there as warnings for HTML validators, but are unnecessary for general parsing and rendering.
And these are things we can't easily say about XML parsers. There are certain errors from which XML processors are allowed to recovery, but which ones those are depends on which parser is run.
---
> I do like adding restrictions on confusing patterns with no known legitimate use cases or better alternatives.
HTML was based loosely on SGML, a language designed to encode structure in a way that humans could easily type. Particular care was made in SGML to allow syntax "minimizations" (omitted tags, for example), so that humans would overcome the effort to encode the required structure. It was noted in the spec that if people had to type every single tag they would likely give up. They did.
But SGML also had well-specified content models in the DTD, formalizing features like optional tags, short tags, tags derived from content templates, default attribute values. Any compliant SGML parser could reconstruct the missing syntax by parsing against that DTD.
HTML missed out on this and effectively the DTD was externalized in the browsers. The effort was made to produce a proper SGML DTD for HTML, but it was too late. Perhaps if there had been widely-available SGML spec and parsers at the time HTML was created the story would be different.
Needless to say, these patterns are the result of formal systems taking human factors into their designs. XML came later as a way to make software parsers easier, largely abandoning the human factors and use-case of people writing SGML/HTML/XML in a text editor.
SGML is still rather fun to write and many of these minimization features are far more ergonomic than they might seem at first. If you have a parser that properly understands them, they are basically just convenient macros for writing XML.
Yes, thanks for pointing out that "valid" should not be thrown out too easily. And it happens that I made and mistake and the snippet is actually valid, a pattern shared with a small set of others exceptions, exactly as you point out !
Thanks for pointing out key aspects of the story, I had a loose knowledge about it.
XML brought in "empty elements" by adopting SGML's "null end-tag" (NET), part of the SHORTTAG minimization (though it changes the syntax slightly to make it work).
In SGML's reference concrete syntax, the Null End-Tag delimiter is `/`, and lets one omit the end tag.
<p>This is <em/very interesting/.</p>
Here the EM surrounds "very interesting." In XML the "Null End-Tag Start" becomes "/" and the "Null End-Tag End" is ">".
So in XML a `<br />` is syntax sugar over `<br></br>` and identical to it (an element with _no_ child nodes), but in HTML `<br>` is an element which _cannot contain child nodes_ and the `/` was never part of it, required, or better.
As called out in another comment, the trailing `/` on HTML void elements is dangerous because it can lead to attribute value corruption when following an unquoted attribute value. This can happen not only when the HTML is written, but when processed by naive tools which slice and concatenate HTML. It's not _invalid_ HTML, but it's practically a benign syntax error.
These two things have nothing to do with each other though:
- HTML doesn't have XML-style empty elements such as in "<div/>" and simply ignores the slash, treating the construct as start-element tag
- HTML is "hardcoded" in that the vocabulary understood by the browser is determined out of band, whether for empty ("void") elements or not. An SGML DTD (like the W3C HTML 4.01 DTD or newer DTDs for HTML 5+, which would be the means to inform a parser about the parsing rules for a markup language dynamically), actually contains declarations for the meta, img, and other elements as having EMPTY content, along with tag inference and attribute shortform rules also necessary for parsing.
It's mentioned in the GitHub readme, SuperHTML only supports HTML5 regardless of what you put in the doctype.
I will probably keep tracking the WHATWG living spec and never implement any other spec (old HTML versions, XHTML), so if you have a keen interest in anything other than contemporary HTML, SuperHTML might not be the right tool for the job.
That said, I'll add a warning if you put something else in the doctype so that users can know what to expect.
I got this error first with HTML doctype, so it is a problem in my code which uses self-closing tags with HTML doctype. Instead of very noninformative "html_elements_cant_self_close" message it should explain that HTML spec does not have "self-closing tags", and this code uses deprecated feature from previous (XHTML) specs.
You're getting a very clear error: your sample is not valid HTML, because, as the error informs you, HTML elements can't self close. Why would it reference some other document type for which this may or may not be valid syntax? Should it also tell you that these are not valid JSON objects or LISP forms?
Now, separately from that, it might be nice if it told you that it doesn't support DOCTYPE == XHTML and will treat the document as HTML anyway, but that should only appear as an error/warning on the DOCTYPE line, it shouldn't confusingly mention XHTML wherever you might have meant some XHTML-ism that is invalid HTML.
Also, XHTML is not "an older version of HTML", it is a completely different standard that is (or was) sometimes used instead of HTML.
"html_elements_cant_self_close" is not a very clear error.
Yes, now it is clear what does it mean, and it would be great if other users should not start a new HN-discussion to receive clear explanations when it can be explained directly on "html_elements_cant_self_close" tooltip. We can polish text, decide to mention XHTML or not, but the error message should be more informative.
> On void elements, it does not mark the start tag as self-closing but instead is unnecessary and has no effect of any kind. For such void elements, it should be used only with caution — especially since, if directly preceded by an unquoted attribute value, it becomes part of the attribute value rather than being discarded by the parser.
I think this is actually a decent design decision. Unlike in networking, where application of Postel's Law (specifically, the "be liberal in what you accept" part) can lead to implementation details becoming de-facto part of the spec, SuperHTML isn't outputting data to another system for further processing - it's giving information (validation errors) to a human for their own use, and if this information leads to humans writing HTML that's a more-robust strict subset of the formal HTML spec, all the better.