Hacker News new | past | comments | ask | show | jobs | submit login

Is there an article with the background on this? Why do we have both the W3C and the WHATWG, and why do the W3C just copy and paste work from WHATWG, if that is indeed what happens?



I don't know of an article, sorry. A brief history from memory would be that during XHTML days the W3C essentially let the HTML spec languish and people weren't moving to XHTML (at best they were moving to XHTML-like HTML).

So the WHATWG came along (mainly organised by the major browser vendors) and started the HTML spec moving again. This became part of what's known as HTML5.

However WHATWG doesn't exactly make a "standard" it makes a "living standard", which is a constantly shifting document which aims to describe where browsers currently are and what they hope to implement. The W3C decided to keep publishing its own HTML specifications and, as the WHATWG does describe what browsers are trying to do, the W3C's spec has to build at least partly on that work. There are differences though. For example, the W3C requires at least two implementations of a feature for it to be included in their spec.

The WHATWG has always opposed the W3C's spec. They see it as confusing to have two "official" specifications.


To put a slightly different spin on the same story as perspective always colours the telling:

W3C decided to deprecate HTML in favour of XHTML. Most of the web quickly moved to XHTML. One individual (an employee at Opera, then Mozilla, finally and currently Google) wrote an oddly influential opinion piece saying the the move to XHTML had been somehow harmful and pushed for the major browser vendors to form a rival non-democratic standards body (WHATWG) to the W3C, which forked and completely redefined HTML.

The W3C, which unlike the WHATWG has many voting members from many backgrounds, not all related to browser making, quite understandably was never fully on board with the new WHATWG HTML spec efforts. However, with the level of adoption and support it received (mainly from being the creation of the powerful browser vendors) W3C were eventually pressured into conceding to advocate for HTML. Which they've done by maintaining a copy, rather than blindly directing people to the work by what for all intents and purposes effectively amounts to a rival organisation, and an extremely undemocratic one at that.

As web developers, we should follow the WHATWG and ignore the W3C, because the W3C have lost the political battle for HTML and we need to get our stuff working on browsers, all of whom follow WHATWG. But that's an unfortunately pragmatic approach that shouldn't amount to acceptance.


> Most of the web quickly moved to XHTML.

This simply is not true. The web moved to an XHTML-like dialect of HTML which was still served as text/html and browsers interpreted it as "HTML soup" because actually serving pages as application/xhtml+xml would have broken the majority of the web because browsers would actually validate them and refuse to display a page at all if there was even a single missing close tag.


> This simply is not true. The web moved to an XHTML-like dialect of HTML

You're thinking of XHTML 1.1 or XHTML 2. That "XHTML-dialect" that everyone switched to was called "XHTML 1.0", which allowed serving as either content type.

If you're choosing to nitpick about the fact that most sites published would not have worked if served as application/xhtml+xml, I'd invite you to do a survey of sites currently being served as valid HTML5. It's not even that easy to verify as the Nu validator version in use varies so much depending on where it's hosted (or if it's a local jar), and which iteration of the living standard it conforms to is always ambiguous. Have you tried reading the WHATWG spec diffs?

The burden on devs who might like to adhere to any kind of strict automated verification of spec. conformance is now out of the question. With XHTML, even if you were serving non-well-formed XML with a text/html content-type, at least your markup could be trivially checked for conformance by almost any XML parser to see why it's not well-formed. It was actually conceivably viable to put that check into build steps or CI.

Serving application/xhtml+xml was a nice to have, but anyone believing that serving XHTML as text/html had no value completely missed the point. At least now, years later, the mess we're stuck with should make it a little easier to see though.


OK, so by lucideer's quirky definition of "XHTML", the vast majority of the web moved to XHTML. Based on the expansiveness of lucideer's definition, this appear to have encompassed web developers who probably weren't even aware they were writing "XHTML".

By the definition that most of us are using, which is that XHTML is complaint XHTML that could be rendered without error in browser's XHTML modes, to a first approximation nobody ever did it. Even today XHTML-levels of precision in HTML requires an awful lot of API support and very careful usage; doing it ten years ago was above almost everybody's skill level.


> by lucideer's quirky definition

Which also happens to be the definition the w3c xhtml 1.0 spec. You can choose to think that's quirky, please don't attribute it to me.

> definition that most of us are using, which is that XHTML is complaint XHTML that could be rendered without error in browser's XHTML modes

Which, again, is the definition used in the later w3c xhtml 1.1 & 2 specs, the former which wasn't widely used, the latter which was abandoned without being published at all.

If your issue with XHTML was that W3C were moving towards a direction you disagreed with, then you don't have an issue with the version of XHTML that was in popular use.

> XHTML-levels of precision in HTML requires an awful lot of API support and very careful usage

I'm not really sure where this view comes from. HTML validation is a lot more complex and difficult to achieve than XML well formedness, and HTML4/XHTML1 validation were both far simpler than modern HTML5 validation (the Nu validator is inordinately complex in comparison to the older DTD one). Furthermore, dev tools for ensuring XML well-formedness are far more readily available and integrated into most things even today, while HTML5 validation is such an obscure concept today I'm sure many devs don't even know it's a thing.


Which also happens to be the definition the w3c xhtml 1.0 spec. You can choose to think that's quirky, please don't attribute it to me.

Except that the number of people who actually implemented valid, well-formed, properly-served XHTML Strict -- of any version -- in compliance with all the relevant specifications is at best vanishingly tiny. XHTML Transitional was tag soup.

Your retort further up about many sites serving invalid HTML5 actually works against you, since HTML5 explicitly has a forgiving parsing model, while XHTML is explicitly "every error is a fatal error". If browsers had enforced the XHTML approach on every document using an XHTML DOCTYPE, we would have seen the death of XHTML much earlier.

This is why people say XHTML was never really adopted -- many people certainly put a "/>" to close their empty elements, and stuck an XML prolog and an XHTML DOCTYPE up at the top, but surveys like the infamous "XHTML 100" showed that next to nobody actually adopted XHTML in a manner compliant with the relevant standards.

And I say this as someone who, way back in the early 00's, was serving valid, well-formed XHTML as application/xhtml+xml. XHTML was a terrible approach, and the W3C process was dragging farther and farther from practicality at every revision (remember XHTML 2.0?).


You're taking about the ease of validation. Everyone else is talked about the ease of writing.


Oh, you mean like AMP, which now every major site supports, and which is even stricter than XHTML?


Turns out, when there's financial incentive to use strict syntax, people will... guess that's all XHTML lacked...


I mean... yeah? "there must be an actual benefit to do something that costs me development time = money". XHTML did not offer this.


> If you're choosing to nitpick about the fact that most sites published would not have worked if served as application/xhtml+xml, I'd invite you to do a survey of sites currently being served as valid HTML5.

Completely different thing. XML processing and all reasoning based on the premise of XML processing are fiction when XHTML is served as text/html. The HTML parsing algorithm and tve rest of the processing requirements is not fiction when HTML is invalid.

(Why are we still talking about this in 2018. Sigh.)


Because someone asked and it explains some of the history quite nicely.

FWIW, maybe I'm the 1% but I wrote valid XHTML 1.0 for a while, but also soon gave up :P


I did server side browser sniffing to give IE the version it understood (IIRC it couldn't handle well-formed XHTML served with the proper mime tag, not sure, it's been a while :D) while everything else got proper fully compliant XHTML. I'm pretty sure I used a code snippet from Anne van Kesteren who is also posting here ;)


And the versions of IE in use didn’t support application/xhtml+xml anyway so you would have to switch to text/html based on the user agent string.

It was never clear what the technical benefit of this was supposed to be. I only ever saw one site whose pages served double duty as an API and UI by serving styled XML. It seemed like a challenging approach to pull off well.


I wrote an XSL stylesheet that turned an HTML page into a pretty-printed and syntax highlighted display of its source code.


The shoe web site skechers.com used to do this. With the removal of XSL support from browsers, though, it looks like it's now using some form of JS templating.


The Gentoo website / handbook does this.

Or rather, did this a few years ago when I was last messing around with Gentoo. It seems to be HTML now.


The Handbook was interesting in that it was one of the few sites that actually went with the XML + XSLT = XHTML route. Of course, nobody knows XSLT, and everyone hates XML, so it was dumped in favor of MediaWiki, which everyone still hates, but at least now mostly understands how to use. (although the same people that insisted we use XML+XSLT also insisted we use SMW, which is even worse... I gave up then, but I hear they're trying to undo SMW now.)


Interesting!

What's SMW? I'm not familiar with the term and searching for it isn't being particularly helpful.


Semantic MediaWiki


Oh I remember this period of time and damn this is true. I think about 80% of the pages on the web during a certain time period had that tramp-stamp of XHTML Validated button somewhere on the page.


Only the cool sites :)


And you couldn't use target="_blank" in anchors...


I remember the fierce battle in my mind trying to judge whether I wanted a proper strict xhtml page, or I wanted external links to open in a separate window... This was literally what drove me away from strict xhtml. Everything else I was on board for at the time.


I'm not certain that it's true to say most of the web quickly moved to XHTML. Sure a number of sites advertised themselves as XHTML but they were not strictly XHTML compliant. This could be due to third party widgets or other included code or it could be due to a mistake in template construction. Whatever the issue fully compliant XHTML wasn't used much in practice outside of hand-crafted pages.

Also Internet Explorer, for example, never implemented XHTML which would have been a deal breaker for many sites.


> they were not strictly XHTML compliant

The vast majority were not strictly XHTML compliant, but whatever the figure was, I'd imagine it wasn't too different to all the many "strictly HTML compliant" sites now (compliant according to which commit?).

The point was they used XHTML, which means they could trivially choose to validate and test their XML well-formedness with built-in tools everyone had ready access to. Exposing your end-users to those conformance checks (i.e. the in-browser strict XML-parser) wasn't the only "value" offered.


The point was they used XHTML, which means they could trivially choose to validate and test their XML well-formedness with built-in tools everyone had ready access to.

I'm honestly trying to figure out whether this is satire or not.


That seems to ignore the conventional wisdom that according-to-Hoyle XHTML was a DOA standard because it mandated error handling in ways that no browser implemented and most of the authoring community didn't want. Authors don't write well-formed XML, even today.


> it mandated error handling

XHTML 1.0 didn't mandate so-called "draconian error-handling", it just offered it as an optional feature.

XHTML 1.1 (which was released but noone used) and XHTML 2 (which was never finished nor released) did mandate it. I wasn't a big fan of that decision, I don't think it would've worked, but XHTML 1.1 was still very usable while ignoring that one requirement; throwing out the baby with the bath water was a massive overreaction on the WHATWG's part.


XHTML, other than Transitional, which nobody should count as "implementing XHTML", is an XML application. It inherits XML's parsing model. Every error is a fatal error.


>because it mandated error handling in ways that no browser implemented

IIRC Opera implemented XHTML error handling.


IIRC several major browsers implemented XHTML error handling, but only for documents with a Content-Type: application/xhtml+xml header, which was basically nothing because that would then trip up other browsers


Opera had the "draconical" approach, where upon the error you just had that, an error. Firefox, iirc had a softer approach where you still got the page rendered, but you'd get the error reported too. Anyway it all depended on the proper MIME type for the XTHML (as it should). However the whole MIME type and everything associated with it (some elements and APIs are treated differently) is a whole barrel of worms, so XHTML in any of the incarnations was never a good idea.


> "draconical"

That's xml error handling, following rules as written.


"Draconian" error handling is a term of art in HTML.


> not all related to browser making, quite understandably was never fully on board with the new WHATWG HTML spec efforts.

That's the other thing that pisses me off about the WHATWG, is how much they shit all over XML and other interoperability technologies. E.g. their URL standard (because, why not fuck the IETF as well) basically ignores anything non-HTTP for specious reasons.


The only reason there is a URL standard in the WHATWG is that the IETF URL RFC didn't define error handling and this led to interop problems. So there was a need for a URL standard that _would_ define error handling. The IETF refused to produce one (basically said "fuck you, we don't care about your use cases or interop problems" with slightly more polite wording), so the WHATWG ended up doing it...

I'm not saying this is a great situation. I'm not saying the WHATWG couldn't try to do better at considering non-HTTP or non-browser use cases here. But the representatives of those use cases in the IETF told browsers to take a hike. And then browsers did.


The first line of the WHATWG URL spec says that it deprecates and replaces all IETF URL standards.


For its target audience (browsers and web pages) it does.


A number of folks involved in WHATWG work bought into the XML vision initially, but reality has a strong text/html bias and we've been able to adjust views as experience has accimulated.

See https://annevankesteren.nl/2011/02/xml-tired

(Personally, the first time I managed to get funding to work on a Web engine was to make Gecko's XHTML-as-XML support better. At the time, I thought it was so important that I sought funding to get it done...)


What do you mean by non-HTTP? It handles URLs whose scheme is not http(s): just fine...


I forget the specifics, but there are several incompatibilities between the IETF URI and IRI spec and the WHATWG URL spec (see [1], EDIT: as I'm sure you're well aware, given your username). The WHATWG spec amounts to "what four popular web browsers do", explicitly without considering compatibility with the hundreds (thousands?) of other non-web-browser tools that make use of URIs.

What you've defined are effectively not URLs. Very similar, but different. If you wanted to call them "WHATWGRLs" or something I wouldn't care. But they're not URLs, and the WHATWG is choosing to muddy the waters rather than, say, specify an optional legacy compatibility layer on top of the IETF spec. It's one thing to say "in addition to IETF URIs, browsers should also accept these malformed URIs, but should not accept these valid but problematic URIs"… it's quite another to say "URIs aren't that anymore, now they're this".

[1] https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/


I'm not sure I get the distinction. As for curl, it doesn't follow any standard which seems worse, but does at least helpfully demonstrate that the RFCs cannot be implemented by major clients.


> Most of the web quickly moved to XHTML

No, it didn't.

At best a large share of new, greenfield development moved to XHTML, but I'm not convinced it was a majority of even that.


> Most of the web quickly moved to XHTML

Ridiculous and absurd. A small proportion of the web moved to invalid XHTML that rendered as tag soup because it was sent as text/html. Virtually no websites actually served XHTML as XHTML, because: 1. there were and still are no compelling technical benefits of XHTML, 2. it broke Internet Explorer, 3. most webmasters were and still are incompetent and have no clue what Content-Type is.


>W3C decided to deprecate HTML in favour of XHTML. Most of the web quickly moved to XHTML.

In some parallel universes, yes.

Even if so, there's also the fact that XHTML wasn't updated itself with features people needed.


What features?

If you're referring to "features" in the HTML5 spec., like canvas, webgl, geolocation, DOM etc. they were separate specs, which WHATWG lumped into one monolith (though they're mainly JavaScript APIs, and aren't directly related to HTML). They were being worked on separately to XHTML, and still work fine with XHTML to this day.


>they were separate specs, which WHATWG lumped into one monolith

For which I could not care less. Whether there's a big spec for HTML5+JS APIs, or 20 different specs, is a bureaucratic concern, not a concern to the developers or the end users.

W3C might had them "neatly" separated, by it also haven't moved them notch towards completion and release for more than a decade.

I've used and worked for the web before W3C, in its heyday, in its long decline days when we waiting a decade+ for some progress, and after it become irrelevant. Now it's a way better situation.


> Most of the web quickly moved to XHTML.

Most of the web didn't move to XHTML.

A lot of people who were interested in being standards compliant moved to XHTML 1.0 Transitional, which was the HTML compatibility subset, but they only ever served it and validated it as HTML, not XHTML, because if you served it as XHTML, one single stray < that someone had forgotten to quote somewhere would break the parsing of the whole page.

The piece written by Hixie was influential because it was a wake up call that the direction the standards bodies were going in was pretty much fruitless, and that there could be a much better way to do it which wouldn't involve breaking compatibility with all of the existing content and would give web developers and users features that they actually wanted.

> As web developers, we should follow the WHATWG and ignore the W3C, because the W3C have lost the political battle for HTML and we need to get our stuff working on browsers, all of whom follow WHATWG. But that's an unfortunately pragmatic approach that shouldn't amount to acceptance.

I fail to see how there is anything unfortunate about this. What about rewriting everything in XHTML 2.0 (https://www.w3.org/TR/2010/NOTE-xhtml2-20101216/), and having to be extremely conscious of any possible stray < that could sneak in to a page without being quoted, would have been preferable to:

1. Consistent parsing support for existing content, and content that might have slight problems like stray <, in all browsers

2. Standardization of things that people actually use to build web apps, like XMLHttpRequest and Canvas

3. Consistent handling of encodings between browsers, including encoding sniffing

4. Consistent handling of quirks mode vs. standards mode between browsers

5. Actually having browsers support compatibility with vendor-prefixed versions of features, because some browsers widely used introduced prefixed features that web developers actually started relying upon

And also, have you ever tried getting involved with the WHATWG process? I have, and I find that they are very receptive to intelligent discussion of issues.

What doesn't work well is to insist that you have a problem and that this particular solution must be used to address the problem; because a lot of times, it's easy to come up with some proposed solution but it then turns out that it's either a lot more complex in practice, your proposal does't fit in with the rest of the ecosystem well, or the problem can actually be solve just in tooling on top of HTML without having to change the spec at all and then wait for multiple browser vendors to all independently implement it.


> and having to be extremely conscious of any possible stray < that could sneak in to a page without being quoted, would have been preferable to:

Any system that publishes content that would let this kind of thing pass is incredibly insecure, and shouldn't be on the internet. Today it's a stray <. Tomorrow it's a stray <script>

It's no wonder software is where it is today with attitudes like these.


Not if that < had slipped in because it was in a piece of static text in a string somewhere in the source code.

You can apply mandatory quoting to untrusted input all you want, but there are going to be times when you have trusted strings that can still contain stray characters that will make the resulting markup invalid. And in many cases you don't want to have mandatory quoting for all of that, because these strings may have markup you want to include.

And yeah, you can argue that instead of generating content by appending strings, you should be building up a proper type-safe DOM structure that can be serialized. I'll wait while you go boil the ocean of converting every single web application framework that exists now outside of a couple of obscure type-safe functional programming frameworks, and in the meantime I'll be able to browse the real web without every other page giving me validation errors.


To be fair, I only use obscure type-safe functional programming frameworks. That's what I'm employed to do, and this obviously impacts my feelings on the matter. Personally, I think it's irresponsible to use anything that could be this unsafe. This doesn't mean everyone needs to use FP, just that frameworks and libraries should be chosen so as to guarantee safety. There are easy-to-use libraries for all these things in every language.

In no other world of engineering is this attitude okay. If you were a civil engineer and had to hold a license to practice due to the danger your designs could present to society, this attitude would eventually cause you to lose your ability to practice. It's becoming more and more clear that software can have similar levels of impact, and software engineers should practice as scuh.


I agree with you that we do need to do better about writing more robust software, and type safe languages are a good way to do that.

But what you're saying is as if you suggest that since the metric system is more consistent and more widely used than the English, I as a bolt distributor should start selling my bolts in metric sizes, despite the fact that the nuts that everyone has are in English sizes.

The browser vendors, at least, are working on implement their browsers in more type-safe languages (https://github.com/servo/servo), but even still they have to work with the content that is produced by thousands of different languages, frameworks, and tools, and millions of hand written HTML files, templates, and the like. Just turning on strict XML parsing doesn't make that go away, it just makes your browser fail on most websites.


A good first step in enforcing web standards would be if browsers would detect these rule violations, and -- instead of failing -- put a giant banner on the top of the page warning end users that the site may be compromised and could compromise data.

Soon, every business will be clamoring to fix their buggy software, and users will still be able to access the unsafe websites they so desire.


You can be unsafe even with typesafe builders. See

fn build(text_to_show: &str) -> HTML{ HTML(Body(H1(text_to_show))) }

What if text_to_show wasn't sanitized? You got yourself a XSS. And if you do sanitize it (and keep it in a StrSanitized type), what are the chances of accidental XSS?

Really, what should have been done is a "user supplied tag", which automatically displays everything as plain text, like <user-supplied id="ahdjdh37736xhdhd"> Content </user-supplied id="ahdjdh37736xhdhd">


You would generally want the general purpose string type in your language to always be escaped when serializing, and only allow avoiding that if you opt-in explicitly.

So, for instance you'd have an H1::new(contents: TextNode) constructor, and you'd have to build a TextNode; if you build TextNode::new(text: &str), then it would escape it. If you wanted to explicitly pass in raw HTML, then you'd need something like HTMLFragment::from_str(&str), and it would parse and return the fully parsed and appropriately typed fragment object that could then be used to build larger fragments.

There might be some way to unsafely opt out, like HTMLFragment::from_str_raw(&str), that would just give a node that when traversed would just be dumped raw into the output, but that would be warned against and only used if you wanted to avoid the cost of parsing and re-serializing some large, known-safe fragment; it wouldn't be what you would normally use.


Your builder isn't really using types to guarantee safety. You can write untyped programs in a strongly typed language, by just coercing everything to strings, but this isn't what I mean when I say 'type-safety'.


> The WHATWG has always opposed the W3C's spec. They see it as confusing to have two "official" specifications.

The WHATWG has not always opposed the W3C's spec. The WHATWG explicitly agreed to work with the W3C to form an edited, snapshot spec based on the WHATWG spec. That's what HTML5 was supposed to be.

However, the W3C process then hijacked this, by dropping things from the WHATWG spec, adding things back that had been removed because they had never been implemented properly and implementing them wouldn't have been very useful, and so on. The WHATWG objected to this useless divergence.


> W3C process then hijacked this

While I agree the W3C's insistence on maintaining a parallel spec. is a silly idea they should absolutely abandon, I fail to see how any but the most biased perspective could conclude that they are "hijacking" a process of their own. W3C haven't dropped anything from the WHATWG spec.: that's separate and out of their control. They can drop what they like from their copy, it's their copy. Unless you're proposing that the WHATWG should be running the W3C, I'm not sure what you're getting at with the term "hijack". Surely you can't hijack your own thing?


Because there's no point in reconciling the specs if you don't actually reconcile them.

If the W3C spec is a snapshot, possibly of a subset, possibly with some editorial but not functional changes, then reconciling the specs is useful; it gives you want the W3C wants to provide, versioned, frozen specifications that can be used as the basis for other specs, for people to claim "full conformance" with a particular version, and so on.

Or, if the W3C process identifies real issues, then it should work with the WHATWG community to resolve those issues; since the WHATWG spec is being used as the upstream, evolving spec that these snapshots are being made from, it makes the most sense to get the changes into the upstream first, so you don't have to resolve the issues every time or maintain divergence forever.

However, the W3C instead just insisted on writing the spec the way it wanted, without regards to whether it would actually be implemented.

It makes no sense to publish a spec that will never be implemented by any of the projects that actually have real-world implementations, and differs from the spec that the implementers actually use. That just causes confusion.

So yes, they hijacked the process in the sense that the WHATWG agreed to work together with the W3C, but the W3C never really worked in good faith to resolve differences or provide technical arguments for their changes.


> there's no point in reconciling the specs if you don't actually reconcile them

completely agree

> yes, they hijacked the process in the sense that the WHATWG agreed to work together with the W3C, but the W3C never really worked in good faith to resolve differences or provide technical arguments for their changes.

I can't see either party working in good faith. In what way did the WHATWG's "agreement to work together" bear out in terms of a positive contribution to W3C's parallel spec., which we can both agree isn't a great idea but I'm failing to see how the WHATWG is a positive actor here in any way; they've forced W3C into an impossible position through political bullying and somehow W3C are vilified for hijacking something?


Ian Hickson, the editor of the WHATWG spec at the time, acted as editor of the HTML5 standard at the W3C for a while after the two groups agreed to work together. However, the committee had chairs who could override the editor.

However, despite a couple of years of effort working together, the W3C process allowed for a lot of people to raise objections that re-litigated a lot of things that had already been decided in the WHATWG process, or just didn't have implementer support, or whatnot. This led to the HTML5 draft specification being stale, as these objections held up migrating the editor's draft (which was the WHATWG specification) to the TR on the W3C site.

So lots of people who still saw the W3C as the "official" source of HTML were brought to an out-dated copy of the standard, because publishing more interim drafts was held up with all kinds of bureaucracy; and the W3C objected to linking to the WHATWG copy to suggest a more up to date version with bug fixes, so there was a fight over this.

The combination of the W3C's heaviweight process making it easy for lots of people to raise objections to slow down the process, and having the ability for those objections to be escalated above the editor, eventually made Hixie give up on editing HTML5 and just go back to editing the living specification.

The thing is, a specification only makes sense if it's actually implemented. Lots of non-implementers raising blocking issues on wishlist features, and then having to take the time to formally resolve all of those issues, does not make for a productive environment; and when the resolutions of those issues are escalated to chairs of the group or higher up in the W3C against the support of the implementers, it really hampers the process of coming up with a productive spec.

By the way, I haven't followed this drama in a few years, but taking a look at what's happening now, it looks like the W3C is essentially just plagiarizing the work of the WHATWG.

Features are generally discussed in the WHATWG, or implemented by browsers and then proposed, and the spec writing goes on there. After the spec is reasonably well worked out, the W3C is copying and editing some of the text into their standard.

Now, the WHATWG spec is under a Creative Commons attribution license, and the W3C does provide a small attribution in the acknowledgements section, so they are not violating copyright.

However, what they are doing essentially amounts to plagiarism as they are presenting themselves as the source for the standard. The introduction to the standard doesn't indicate that the actual work is going on in another group; they invite people to make comments on the W3C's GitHub. This is confusing, it gives people an out-of-date view of the standard, and it seems to be a move to make the W3C seem to still be the relevant authority when it's basically just cloning the standard from the WHATWG, but with enough wording and formatting differences that it could conflict and is hard to tell when it would.

Alternatively, they could fork the standard but do so more in the way that distributions package packages; take what's from upstream, have a separate set of patches that they apply on top that make it clear what the differences are. For instance, those patches might apply their layout, their disclaimers and the like, possibly disable some things that they think are underdeveloped or contentions and likely to change, and otherwise mostly just freeze the text. They could push any patches that they thought were for meaningful differences that they've fixed to the upstream project. They could properly attribute the WHATWG spec as the original source at the very top of the article, and list the editors of the WHATWG spec as the primary editors and the people doing the W3C release as maintainers of that particular fork.

But instead, they are listing as editors people who are basically just doing light paraphrases of the WHATWG spec.


> By the way, I haven't followed this drama in a few years, but taking a look at what's happening now, it looks like the W3C is essentially just plagiarizing the work of the WHATWG.

Ditto, and it's why if anyone asks about HTML and spec. conformance, I don't even mention the W3C, except to dissuade them from paying any attention to them. Their current HTML work is irrelevant and misguided.

My issue here is more with the historical negationism around the relationship between the organisations. The W3C's current HTML is, frankly, wrong-headed. But the context around their current situation is the fact that they've been bullied, cajoled and even somewhat ridiculed reputationally into these quite irrational actions by the WHATWG's very existence. That fact is lost when they're accused of acting negatively toward the WHATWG (e.g. hi-jacking apparent agreements and processes), when the actual background was WHATWG originally hi-jacking the specification of the web's central language.

Your post here is supporting the idea that the W3C's current direction on HTML is irrational. That's fine, I agree. But what they're doing is no worse than what WHATWG did originally with HTML5; the only differentiator is that WHATWG was extremely powerful (being primary implementors) and could use that power to win hearts and minds of pragmatic developers. The W3C have no such power and as such their wrong-headed actions are fruitless. But the equivalence is still worth pointing out.


For anyone confused about how the WHATWG came to write a new HTML spec entirely from scratch, after the W3C blocked the work happening at the W3C, this is a good place to start: http://diveintohtml5.info/past.html#webapps-cdf

I'm going to assume your point that the WHATWG is "extremely powerful" compared to the W3C was meant to be satirical.


> The WHATWG has always opposed the W3C's spec. They see it as confusing to have two "official" specifications.

As the old joke goes... if it hurts, they should stop doing that!

The WHATWG spec is worse than useless to me as a developer. It's impossible to tell what is usable and what is just Google's wishlist (which is about half of it). The MDN has entirely replaced it for me, since they at least do a good job of documenting reality.

WHATWG should quit trying to bully the W3C out of the field, and instead clearly mark the WHATWG "spec" as what it is: a public notepad for browser developers. Leave the business of documenting what browsers actually conform to to the W3C.


> The WHATWG spec is worse than useless to me as a developer. It's impossible to tell what is usable and what is just Google's wishlist (which is about half of it). The MDN has entirely replaced it for me, since they at least do a good job of documenting reality.

The WHATWG living standard is largely where browser vendors (and other interested parties) work out what the web will be. W3C (with their implementation requirement), and, as you note, MDN serve to describe what the Web is. The latter is more useful to developers, but, as you suggest, MDN is doing a better job of it.

OTOH, to get to a place where things have interoperable implementations, a forum for implementors to collaborate on forward-lookong specifications is necessary, and that’s what WHATWG does well, and W3C does not (which is why WHATWG exists.)


I agree with everything you said. What rubs me the wrong way about the WHATWG is that they give the perception (and it may be just that) that they are trying not just to serve as that forum for browser makers, but also as the standard reference for web developers (which is the role the W3C HTML specs, save XHTML 2, have historically served), and doing a poor job of the latter.


I don't think the WHATWG is trying to serve as the reference for web developers (though their HTML spec has notable and laudable features for that use); I think they are mostly fine with W3C trying to do that as long as they do it correctly (which requires alignment with what browsers do, otherwise developers will target a non-existent platform.)

I don't know if they (or developers, MDN is probably a more widely used reference than W3C) see a standards body as essential in that role, though, and I don't think it seems W3C really wants to accept being relegated to that role rather than driving the web platform, even though they haven't driven the platform for a long time.


Exactly what is the purpose of a standard reference for web developers that fails to track the documented behavior of browsers?


I'm not sure the intent of your comment; failure to track what browsers actually do is exactly the problem with the WHATWG "living standard" – it's very much a forward-looking spec at best, and too often a wishlist.


The ideal situation is for the WHATWG document to be a roadmap of what vendors have discussed and tentatively agreed on, and the W3C document to be a periodic snapshot of what's actually been implemented.

That wouldn't make either one of them "bad". The issue here seems to be W3C wanting to push forward things that the vendors haven't agreed on or implemented yet.


I don't think the W3C DOM document has anything the developers haven't agreed to. The problem is that it's an incomplete, intrinsically out of date, and often buggy subset of the the WHATWG living standard.

I agree that the W3C value proposition COULD be to publish a snapshot that describes what's actually implemnted. That might be a way forward here, but it requires a lot of work to define what "actually implemented" means in a useful way, and to check the test results and update the document (or build an automated way to harvest resources such as https://wpt.fyi/dom ).


Great, so if I build a website based purely on the WHATWG specs, it will work in all browsers, correctly?

No, it won't.

I can take the A4 paper spec and build a printer that takes that paper, and I know paper will comply with it. And the other way around.

You can't build a website just from WHATWG specs, and you can't, excluding the parts about backwards compatible parsing, easily build a new browser from scratch either.

A standard is an a-priori written document that describes the entire API surface, so that people on both sides can develop based on the standard without having to verify with actual implementations.

The WHATWG documents are useless for this purpose.


But how are the W3C specs any better than the WHATWG specs?

As far as I know the W3C take the WHATWG specs, and modify them with some of their own ideas so they're different from the what the browsers implement or are planning to implement.

What on earth is the point of that? Why design your own spec that nobody is implementing or planning to implement? What a waste of time!

And back to your point - why is it better than the WHATWG specs?


> Leave the business of documenting what browsers actually conform to to the W3C.

Documenting the prevailing conditions is very much not the purpose of a standard.


That's what the W3C has historically done, with HTML 2.0, 3.2, and 4.0. "Document, clean up, and nudge" is maybe a better description. The WHATWG today seems to take more of a "document, don't clean up, and add our wishlist" approach. (The "don't clean up" mentality is embodied in their "don't break the web" ethos; the "add our wishlist" mentality is a consequence of the "living standard" ethos… the "standard" never becomes reality because it is constantly changing.)


I'm not sure where you got this impression, but it's wrong. https://whatwg.org/working-mode stipulates the requirements on additions. That's quite a bit different from a set of wishes.

And there's a lot of cleanup of legacy APIs happening too. E.g., removal of the isindex tag and deprecation of AppCache.


That is the governing philosophy of the WHATWG, sadly.


That's a bit of a stretch. This is only relevant to legacy APIs and only when all implementations are in agreement, which is quite the rarity.


I think that's the rub - if w3c wants to "document" what a browser conforms to - I imagine it will be a copy-paste from what the browser vendors are doing in their separate meetings of the minds.


Urm, I thought MDN was based on WHATWG? (Not W3C)


MDN includes clear documentation about what is actually implemented in all major browsers (the compatibility tables), so I (as someone who wants my code to work everywhere now, not next year) can tell at a glance what pie-in-the-sky ideas I should ignore.

That's great that it's based on the WHATWG's work – it should be, since it should document what's in Firefox, and Firefox presumably is following their own work with the WHATWG. But the WHATWG shouldn't pretend that they're useful to me in any other way than a preview of what's coming down the pipeline. For that, I need clear documentation of what is, not what will be. W3C HTML specs prior to HTML 5 (with the exception of the abortion that was XHTML 2) have historically served that purpose well. It was easy to make the judgement that, once my target market primarily supported HTML 4, I could use anything in that spec. The WHATWG "spec" throws that idea out the window.

Ideally, with a "living standard", periodically there are snapshots of some form that document what all or most major browsers supported as of some point in time. So I as a developer can say, "well I know most of my target market have updated their browsers since date X, so I can just use anything in this standard snapshot". The W3C I think is trying to do this. They might not be doing a very good job (indeed that is the crux of the WHATWG's objections); like I said, I personally rely on MDN to fill this same role for me. But the WHATWG living standard itself cannot fill this role, short of including MDN-style compatibility tables, or making their own snapshots that are somehow "better" than what the W3C puts out.


FWIW, the HTML Standard (not the DOM Standard) does include CanIUse information in a sidebar, to help with this. I'd like to include this into other WHATWG standards, but it hasn't really happened yet. I'd expect most web developers to use MDN and StackOverflow though, as you say.


I appreciate the attempt to include compatibility tables, but they're nowhere near detailed enough for serious usage. Take the canvas element as an example. The WHATWG spec has one "CanIUse" sidebar for basically each section, if that. But compatibility issues exist at the level of individual methods. E.g. .filter and .resetTransform() both have very low cross-platform support ([1] and [2]), which I can tell at a glance from MDN, both in the sidebar listing them, and the compatibility tables on each page. Whereas the WHATWG spec doesn't even mention that these are experimental ([3] and [4]), and the CanIUse sidebar is totally absent for them.

StackOverflow is not a reference, and the answers for even popular queries are sometimes a decade out of date.

[1] https://developer.mozilla.org/en-US/docs/Web/API/CanvasRende...

[2] https://developer.mozilla.org/en-US/docs/Web/API/CanvasRende...

[3] https://html.spec.whatwg.org/dev/canvas.html#dom-context-2d-...

[4] https://html.spec.whatwg.org/dev/canvas.html#transformations



> Differences between the W3C HTML 5.2 spec and the WHATWG Living Spec: https://www.w3.org/wiki/HTML/W3C-WHATWG-Differences

FWIW, I'm pretty certain that is incomplete. It may well be the case that that is the set of deliberate changes from the WHATWG spec (at some revision), but we've had cases before where changes from the WHATWG spec have been copied only partially leading to the W3C spec, as published as a Recommendation (i.e., with two interoperable implementations) has been impossible to implement as written.


http://diveinto.html5doctor.com/past.html

A very entertaining read, IMO.


The W3C was the original standardization organization for the web.

They wanted to create standards that allow easy implementation by others, and were willing to make some tradeoffs with backwards compatibility for that (see XHTML).

The browser vendors obviously oppose this, and want standards that just formalize what they already implement. As result, the browser vendors created their own standards committee, which standardizes whatever the browsers already do (if existing). This is the WHATWG.

As result, the web standards situation has gone to insanity. The WHATWG URL spec contains 4 pages of pseudocode and algorithm definitions for how Chrome parses URLs, and how you should as well, and the W3C, still being relied on by the other actors on the web that aren’t the 4 largest browsers, has to copy the WHATWG spec as base for their own specs, because browsers will ignore whatever the W3C says anyway.

But remember that the WHATWG proposed to the W3C that the W3C should copy the WHATWG specs as base for their own specs: https://en.wikipedia.org/wiki/WHATWG#cite_note-9


To offer another perspective.

I'd say the W3C wasted a huge amount of time pursuing quests of purity (XHTML) over actually making the web better for users. I see the value in what they were trying to do, but it wasn't letting people do the things they wanted to do on the web.

As browsers started just implementing features outside of standards in completely disparate ways because everyone was desperate for them (leading to plugin hell, apps rather than websites, etc...), WHATWG was created to try and ensure that the web remained a single thing and not a mess of things that would only work in one browser.

The web platform sprang forward massively as a result of this, with browsers implementing much more consistently and with new features tending to be implemented in compatible ways, with real progress being made.

This lead to the W3C specs becoming totally redundant and the only way for W3C to keep up was to lamely try to copy from WHATWG into a "spec" at random intervals and claim it was something people could work towards, when it reality it offers no real advantages over working to the living standard, because no browser offers better coverage of that spec than any other random point of the living spec.

People want features. We saw what happens with a very slow moving, rigid standard: plugins. Flash was popular because at the time you simply couldn't do good video, animation, games on the web platform. Likewise, mobile phones shifted to apps because websites couldn't do notifications or use location information. You can bemoan a living spec, but you get one anyway, because people will work around the web if they can't do what they want. If you want to use a subset of that living spec, use it, but at least keeping it together and agreeing on roughly how to do these things is better than plugins or abandoning the web entirely.


> This lead to the W3C specs becoming totally redundant and the only way for W3C to keep up was to lamely try to copy from WHATWG into a "spec" at random intervals and claim it was something people could work towards

So why don't they just disband the W3C? It sounds like it's not needed any more if WHATWG are doing the work?


The W3C actually does do some good work in other working groups; the CSS working groups seem to be working smoothly.

The W3C also oversees a lot of other standardization processes that aren't directly related to web browsers, like RDF. There are people who find this useful.

I think a lot of it is a power play. The W3C wants to be relevant, and the most relevant things in the web world are HTML, DOM, and CSS (there's also ECMAScript, but that already has a different standards body that owns it).

There are a lot of other standards that use HTML, CSS, and the DOM, such as ePub. The W3C wants to be the normative reference for these core web standards; many times, one standard will have to refer to the other, so the W3C wants to be the one that defines the "official" HTML standard.

But the W3C's process and policies are just terrible. They let people take over standards who have no intent on working with those most impacted by the standards, the people who develop the browsers that billions of people use daily to access tons of diverse content. So instead of just providing a lightly edited snapshot, possibly with some WIP features removed, of what the WHATWG produces, they start going in and meddling and making changes with insufficient justification so you have two forked standards providing a lot of confusion for everyone.


> The W3C actually does do some good work in other working groups; the CSS working groups seem to be working smoothly.

The W3C is also doing work in HTML at least too. If browser makers would actually participate as editors (like they do in CSS) then it would work just as well as the CSS working groups and others.


> If browser makers would actually participate as editors

Microsoft tried that, investing in easier to use GitHub tooling to allow a wide range of people to submit pull requests to update/fix bugs in the W3C HTML standard. "If you build the field of dreams, they will come...." Nope. "They" had all gone to WHATWG ballpark, and all the W3C editors do is cherrypick (that's the actual word in the HTML 5.2 Recommendation) WHATWG's specs. It made a LOT more sense to just join WHATWG for HTML (and DOM).

> > The W3C actually does do some good work in other working groups

Right, W3C as a whole does a lot of good work. CSS is a good example, Web Payments, Web Authentication, Web Assembly come to mind as groups where a broad group really does come together and build consensus on how to solve hard problems. The HTML and DOM communities, however, have moved to WHATWG for reasons that happened long ago and apparently can't be un-done, even if a company with Microsoft's resources tries.


>Microsoft tried that, investing in easier to use GitHub tooling to allow a wide range of people to submit pull requests to update/fix bugs in the W3C HTML standard. "If you build the field of dreams, they will come...." Nope. "They" had all gone to WHATWG ballpark, and all the W3C editors do is cherrypick (that's the actual word in the HTML 5.2 Recommendation) WHATWG's specs. It made a LOT more sense to just join WHATWG for HTML (and DOM).

It won't work if only one browser maker will participate. If only microsoft participated and implementing things in the CSS WG then nothing really would get done over there too.

If all the browser makers would have editors in the w3c html spec (like they do in many other w3c specs) and agree to implement stuff there, then that would also work.


How would one convince the others to re-invest in W3C HTML and DOM? Microsoft's rationale a few years ago was that WHATWG wasn't a real standards organization with a patent policy, dispute resolution system, etc., and that created various legal and business concerns.

It turned out to be much easier to add a legal framework to WHATWG than to convince the HTML and DOM standards community to move back to W3C. Basically, people work on specs (and code) together in the places where there is a critical mass of expertise and energy being productively engaged. The key variable is the people, not the organization.

I don't understand the dynamics of how these critical masses of expertise coalesce, break up, and move around. I have learned that it's much more efficient to go with the flow than try to redirect it.


What good work is it doing in HTML?

In HTML, as far as I can tell, it appears to be copying features from the WHATWG standard, paraphrasing them, and including them in their standard, with only a small notice on the acknowledgements page the the HTML standard contains parts derived from the WHATWG standard.

The browser makers did participate. They participated in the W3C working groups up until they were shot down when trying to propose to work on features that users actually wanted and would be backwards compatible rather than backwards-incompatible XHTML 2.0.

The browser makers then proceeded to do their work on rich web applications, with features like canvas and XMLHttpRequest, as well as actually putting together a spec for how to consistently parse HTML that would be compatible with real content, in the WHATWG.

When it was clear that the WHATWG standard was the one that actually mattered because it was what was actually implemented, the W3C invited them back in to start working on the standard together. That's what HTML5 was; the W3C agreed that they would start from the WHATWG standard, that they could have the same editor (Ian Hickson), and they wound down the XHTML 2.0 group.

However, various people involved in the W3C process proceeded to use bureaucratic moves to raise formal objections to things that had been changed, and escalated the issues above the editor. Eventually, he got fed up and left the process, and most of the browser vendors proceeded to continue working through the WHATWG. Microsoft was the last holdout, but eventually they too left the W3C process and moved over the the WHATWG as well.

So, the browser vendors have tried to work directly with the W3C on the HTML spec twice, once before the WHATWG split off and once as part of the attempted reconciliation. Both times, they were stymied by other people involved in the process who were more interested in purity and process than actually providing a forum for working out a good specification for real world implementation.


>What good work is it doing in HTML?

From my experience, it has generally done a better job of explaining things developers would want to know, especially in terms of accessibility and internationalisation.

The XHTML stuff was a long time back, and at that time, it warranted having a whatwg. Now that W3C is no longer insisting on XHTML (and hasn't for many years).

>When it was clear that the WHATWG standard was the one that actually mattered because it was what was actually implemented, the W3C invited them back in to start working on the standard together. That's what HTML5 was; the W3C agreed that they would start from the WHATWG standard, that they could have the same editor (Ian Hickson), and they wound down the XHTML 2.0 group.

This is the crux of it. The WHATWG is really usefull for browser vendors because they can do essentially whatever they want in it without anyone having the power to formally object to it (unlike the W3C). Now the whatwg editors (and thus browser makers) can say that they will listen to community feedback, but thats pretty much a benign dictatorship over the most important spec of the web.


I mean, reading that GitHub thread, it seems to me that everyone involved is saying that's exactly what should happen, and the people invested in the W3C are trying to force through new work just to justify the organisation's continued existence (and presumably, if I were being cynical, their paycheck).

The bone of "you could set specs based on snapshots of the living standard" was thrown to them, but the reality is no one cares enough about it to actually do that well, so it's just being done in a bad way that will hurt everyone.

That thread reads, to me, as "we tried being nice, but now you are causing problems, just stop please".


W3C do a huge amount more than HTML and DOM. All of CSS is coordinated through there for example.


> I'd say the W3C wasted a huge amount of time pursuing quests of purity (XHTML) over actually making the web better for users. I see the value in what they were trying to do, but it wasn't letting people do the things they wanted to do on the web.

Okay, let’s make a deal:

We both write a crawler that can fully reliably parse websites.

I implement XHTML1.1, you implement HTML5. We both get 1 month time.

What do you think is going to happen?

XHTML was a worthy goal – with it, we wouldn’t have a need to run headless Chrome for tests. We could parse the web, and actually use the data. OpenGraph tags would never have been necessary. We wouldn’t need to throw DNNs at rendered output of a browser just to parse data.

HTML5 is amazing for existing browser vendors, developers, and in the short-term, users. But everyone else loses. Horribly.


And if we were all using XHTML1.1 now instead, yes, your parser would be easier, except all the richer content would be in flash, and all the web apps would be desktop applications, and you wouldn't be able to parse that at all, even with a full browser.

You are acting like everyone would just stop and wait for you to make your dream implementation that's ideal - that's not how the world works.

WHATWG was an admission that we can't stop it, so we might as well embrace it. Embracing it has resulted in browsers being far more consistent, and new features being a shared part of the web platform, and not siloed off in plugins and other platforms. That's why W3C is irrelevant now.


You’re assuming XHTML1.1 would never have evolved further, never have gotten more content.

And Flash, despite its flaws, would have been a much better starting point for rich content than the ecosystem we have today.

Many of the features Flash provided are only available in browsers today through babel.js transpilation. As result, we’re stuck with a language without stdlib and broken syntax.

We’re stuck with a document model that’s impossible to work with or parse, and with impossible layout management.

If you want web applications, it’d make much more sense to port the Android layout XML format to the web than to attempt to use HTML5 for it, because HTML5 is insanity for building applications.

> and all the web apps would be desktop applications

I don’t see that as anything bad.

The web is for documents, and lightly interactive content. All the rich applications on the web are opaque to any crawler I could write anyway, as I just get "You need JS to view this React app". Desktop applications would be just as parseable, except they would also be less of a resource hog.


The idea that standards groups exist to push for top-down reconcepting of how the world should work is a common one, and is also a good reason why standards groups fail. Your idea of what the best outcome is won't be the same as every other stakeholders, and no one stakeholder will have exactly the same idea of the best outcome as the market will.

Ultimately, the market will win, no matter what your standards group says.

The point of standards is interoperability, not rationalization. When standards groups try to rationalize technologies real people work with, they cease to provide value, and instead become obstacles that real engineers end up laboriously working around.


When I was a child in elementary school, a standards committee decided to change my native language.

They replaced the spelling of most words, and many grammatical rules.

We were forced to obey these changes, any use of the old rules was counted as mistake in school.

Back then, many older books were still using the old rules.

By the time I left high school, almost no books with the old rules were left. All had been reprinted. All newspapers had switched. Autocorrect programs had been updated with the new rules as well.

In a matter of 8 years, an entire language had changed its orthography and parts of its grammar, top-down, and it worked out fine.

I’m sorry, if an entire human language with 120 million speakers can be updated top-down like that, a web spec can as well.


Apples and oranges - when you introduce a government mandate, you remove the market. @tptacek's aregument is clearly about market forces in publicly defined standards, not government enforced ones.

I'm pretty sure the last thing anyone really wants is what we'd wind up with if web standards were left to government dictates...


There’s no need for a government to enforce standards on the web – there’s already an oligopoly that can do it on their own.

In fact, there’s a single company that can just outright dictate web standards, because they hold almost 70% of the browser market: Google.


What language is this? This sounds fascinating.


My example was the implementation of the German orthography and language reforms between 1996 and 2006 (I went to elementary school in 2002, when implementing it was still in progress, and most stuff was still using the old spelling, I left high school in 2014).

But French has their Academy, which has even more power over language, and afaik, Spain supposedly has similar governing bodies.


Meh- the French Academy has power over prescriptive grammar sure, but it has almost no power over what descriptive linguistics finds. Lots of Arabic and Verlan has made its way into everyday parlance.

I'm not really sure that the French Academy is really that much more effective than Strunk & White is for English speakers. It primarily seems to be ceremonial / an expression of French pride.


If you want a more tech-related example, Jobs said "no flash on iPhones", and in a few years... poof! no flash.


The counter to that is pretty obvious, because if you remember Jobs also said "no native apps on iPhones", and then in a few months... poof! an app store.

Flash was pretty much dead anyway, and the web platform had advanced enough to mostly replace it at that point. That wasn't true for native apps.

If you want to make a standard, it has to let people do the things they want to do. Otherwise, people will just use a different (or no) standard.


I'm not really getting into the standards thing here--just throwing some ammo to the underdog.

My only point is that there are only a handful of companies with the cash, the talent, and the inclination to tackle these things, and most of them are near if not total monopolies, so as long as what they put out there isn't a blatant kick-in-the-nuts, most of us will just accept it.

iPhone was a compelling product, didn't have flash, everyone migrated to Javascript ASAP. Google is practically a monopoly, and when webmaster tools tells people to jump, watch everyone piss away a weekend to add microformats and shave 5% off of a few 40k images.

Serfs. We are all serfs.


That sounds horrible.


It was amazing. The new orthography is much simpler, and has far fewer insane rules or exceptions. And most people that have seen the transition, but were born after it, or went to school during it, agree.

I know it can work on this scale, I’ve seen it IRL. Many languages do stuff like this, German has the council of German language, and French has their Academy.

You can do the same on the web. You just need to have all vendors working together to actually do it.


The idea that a bunch of standards group officials can decide for the world that web pages are simply lightweight content publishing mechanisms and that real applications should be build exclusively in Flash and that that worldview can be ratified and mandated by browser vendors does not seem amazing to me.

At any rate: the Internet is a market system, not a top-down autocracy.


The alternative (and current reality) is that the same things are decided by about four companies in an entirely intransparent manner.

At least the W3C had processes and a wide array of members.


Isn't that just theater? None of them can tell Apple and Google what to put in their browsers; in fact, if they can't convince just one of the big 4 browser vendors to do something, their standards have no meaning at all.


It's even more work than that-- check out caniuse for SVG fonts:

https://caniuse.com/#feat=svg-fonts

They had support in both Safari and Chrome, but never in FF or IE (nor Edge). Chrome eventually dropped the support.

So I'd say if you can't get all four to implement the feature then you might as well call that part of your spec a "living standard." Those features are going to get way fewer eyeballs, fewer bugfixes, fewer reviews, fewer pieces of documentation, etc.


Uh, WHATWG is an open process - they have a similar level of control over things that W3C had.

If you want to try and claim W3C ever had the power to enforce people following their specs, IE6 would like to have a word.


The Internet is a network. The web is an oligopoly. Google, Google-by-proxy, and Apple fill the dog bowl, and the rest of us eat from it because it is there.


If you‘re talking about German, it was not amazing, but a cultural catastrophe, and an extra-legal totalitarian nightmare.


I assume you’re older than 22? There’s pretty much a strict split at around that age. People older seem to consistently hate it, people younger seem to consistently like it, because the new rules are much simpler.

Previously, Gruß and Kuß had no info about how long to pronounce the u – Gruß and Kuss do. And until 2017, capitalizing them into GRUSS and KUSS lost this information, now GRUẞ and KUSS keep it.

Previously, for many words, the rules when to split the word, when to write them together, when to use – was insanity. Now it’s all in a few easy rules.

And you have to remember, this wasn’t the first time German went through such changes – ever since the advent of the printing press, when a written German language was basically "invented" from the many dialects that existed, until today, there have been proponents of a prescriptive language evolution, and they’ve had lots of influence over time.

When you use Tarnen, Verfasser, or Absender, Abstand, Bücherei, Augenblick, Leidenschaft, Entwurf or Briefwechsel, Rechtschreibung or Tagebuch, Grundlage, Altertum, Erdgeschoss, tatsächlich or Hochschule, all these words were defined top-down. (All these words are just from Philipp von Zesen, Christian Wolff, and Joachim Heinrich Campe)

A massive amount of what we consider "German" today was defined and changed top-down, and without these changes, German wouldn’t be recognizable.


You‘re misinformed.

Yes, the German language has had several big changes, but until the reform we‘re talking about it was linguistically „proper“ in that the existing language was described and codified. It was bottom up.

In this reform some non-elected people (who just a few years earlier had said themselves that there job wasn‘t to invent German, but to describe existing use and trends) invented a whole new orthography from scratch. The new rules have never been in use anywhere throughout the German-speaking lands.

They were and are pure fiction.

In linguistics that‘s how you tell a layperson: they think linguistics is proscriptive. Now it seems to be... :-(

And of course people under 22 don‘t care. They have never learned proper German.


You mean, just like in many other languages? According to Wikipedia, French, Icelanding, Spanish, Swedish, and a few more have had varying degrees of prescriptive language standardization.

> Yes, the German language has had several big changes, but until the reform we‘re talking about it was linguistically „proper“ in that the existing language was described and codified. It was bottom up.

I just explained why that wasn’t the case. Many linguists in the past have intentionally invented words (see the ones I mentioned) to make the language simpler, and stricter.

And the same continued until today – the drug store chain Rossmann has been a constant supporter of linguistic prescriptivism, has sponsored groups supporting it, and has been using these concepts in all their published material as well. Many other companies engaged in this as well.

The language has never been defined by the people speaking it, but always by the journalists writing it, the linguists describing it, and the companies influencing it.

And German as a whole was created, as pure fiction, by people trying to publish books across the whole of Germany at a time when everyone spoke local dialects.

At no time has German ever been a bottom-up language – and if we already let our language be influenced and shaped by companies, by media – why not at least use similar influence to make it simpler?

Having a language be simple to use is more important than some fake emotional value of being "natural".


You simply don't understand what I have written. I think we can leave it here.

I don't care about your opinion that it's "fake" and "emotional".

Language is a core part of my being, and a fascist power-grab killing my mother tongue is simply a crime against humanity. It's no different from how the Turks have been treating the Kurdish language.

I have only weak hope, but still hope, that we can someday reverse this. Violently or non-violently.


Do you believe languages are meant to live forever?


But it worked.


I am not sure if this is still the case today, but I remember that not too long after the new orthography / grammar rules were passed, two major news publishers announced that they would return to the old rules.

Also, my sister is a linguist, and I can trigger her going on a long rant just by mentioning the Rechtschreibreform. ;-)

(Personally, I think some of the new orthography rules are much simpler and consistent, so I use them. The rest I basically ignore, unless a spellchecker nags me about it.)


The HTML5 person's crawler will parse some significant fraction of real websites, and the XHTML one won't, because people write HTML5 and not XHTML, even if you as a tool vendor would greatly prefer otherwise.


And yet, forcing people to implement opengraph tags, forcing people to drop flash, forcing people to use HTTPS, forcing people to drop Symantec certs, forcing people to drop SHA1 certs – so often the actors behind the WHATWG have managed to get website authors to change what they use.

Hell, Google has AMP, which is far more intrusive than XHTML ever was, and yet, they’ve managed to get every major website to implement it. https://www.ampproject.org/docs/troubleshooting/validation_e...

And yet, somehow, implementing some stricter spec is supposedly impossible?


There' s a Big difference between "every major website" and "the Web".

"every major website" means 100 companies with skilled developer who can and will react to changes in browsers quickly.

"The web" consists of millions of websites maintained by individuals and small organisations who have no resources to update the way their web pages are coded every year. It contains HTML generated 10, 20, soon 30 years ago. It contains that one app in your intranet with the table layout that you can't replace and that IOT thing you connected to your home Wi-Fi 7 years ago that has no way of upgrading its web interface.

A browser that looses access to "the web" is worst than useless.


There's also a federal procurement picture: big governments making big purchases aren't fans of incompatibilities and standardised solutions.

For a company like MS losing access to "the web" could keep a lot of people from becoming VPs...


IMHO XHTML is pretty painful. Even if you say: okay, there is a server-side auto-generated markup tree and we can formally verify what happens. There is now a solution to that and it's called (server-side) React which is basically a (useful) alternative to XSLT. Except that it outputs HTML5.

Even if you argue that the specs are huge. Just compare book sizes about XML, XSLT vs HTML5 and CSS, JS, React. When I actually tried to do some useful work with XSLT (which needs to be mentioned here IMHO), I realized that the - less painful - 2.0 version is hardly implemented by anyone.

Regarding the parsing: X(HT)ML lexing is ridiculously easy, for HTML5 it's slightly more difficult but not tough at all. You just need to keep a list of closing vs self-closing tags. Not talking about building in fault-tolerance, that would be tough, yes, even tougher for XML!

> XHTML was a worthy goal – with it, we wouldn’t have a need to run headless

> Chrome for tests. We could parse the web, and actually use the data.

> OpenGraph tags would never have been necessary. We wouldn’t need to throw

> DNNs at rendered output of a browser just to parse data.

Yes and no. If you use CSS for styling, the answer is no. If you require JS to show the initial show/page, the answer is no as well. But yeah, if you use the whole XML machinery with XSLT and possibly even XPATH, then you would be kind of right. I mean, as long as we properly handle the schemas and dtd's - which almost no parser does AFAIK. So it's true, one can do pretty bad-ass stuff with all the X*. But tooling and library support is not good and has never been. XSLT 1.x is insanely difficult to use and XSLT 2.x hard to fully implement I guess.


> OpenGraph tags would never have been necessary.

Small correction: XHTML2 had the Metainformation Attributes Module [1]. That then became RDFa in (X)HTML, practically the same syntax and processing model.

Facebooks Open Graph stuff is claimed to be RDFa. When I tested it then their Parser did not really do RDFa processing - other CURIE prefixes for the same URI weren't recognized, if I remember correctly.

But in effect Open Graph meta information would have looked the same in XHTML2 as in todays WHATTF HTML.

For your other argument I agree. WHATWG (and dumb modern style of development) reduced the democratising aspect of the web. But of course the people of WHATWG word for billion dollar companies, which want to have a moat to centralize behind.

[1] https://www.w3.org/TR/xhtml2/mod-metaAttributes.html


The HTML5 crawler would be able to crawl the web, the XHTML crawler would be able to crawl 0.0001% of the web.

XHTML would only improve the web if all existing HTML went away or was changed to XHTML. Since this is never going to happen, XHTML does not simplify anything.


HTML5 is amazing for existing browser vendors, developers, and in the short-term, users. But everyone else loses. Horribly.

I don't understand who the 'everybody else' is in this case and what and when their horrible losing will be.


People trying to build new tools that parse the web.

Try building a crawler without reusing an existing browser engine.

Try running unit tests against your own web projects with Selenium without running a headless browser.

Phantom.JS gave up because they couldn’t keep up with the complexity, and Chrome headless "just works".

Opera gave up on their own browser engine because of the complexity of parsing HTML5 accurately, when the spec is just "whatever Chrome does".

We’ve thrown away an entire ecosystem, just for more flashy graphics.


One of the most valuable aspects of HTML5 is that it defines a parsing model for "broken" HTML.

This means tat, for the first time, it's possible to build a brand new HTML parser that has a high chance of working against all existing HTML without needing to first reverse-engineer existing browsers.

Remember, when HTML5 was first designed Internet Ecplorer was by far the most widely used browser. And IE was closed source. If you wanted to build a parser you needed to first reverse engineer IE and figure out how it handles invalid HTML.

The HTML5 spec fixed that. The thing you are complaining about here (HTML5 making it harder to build a new browser from scratch) is one of the things HTML5 actually solved!


I very much doubt Opera gave up because of "parsing HTML5". Parsing HTML5 is well-defined, certainly better than what was there before (unless you are willing to say "if you don't write perfect XML you don't get to be on the web", but good luck with that – no browser ever was and never will be in the position to do that)

The vast majority of the difficulty of the web platform is in the layers above, which don't care if the DOM they are looking at came from HTML5 or XHTML: layouting/rendering, interactions of JS and DOM, ...


Indeed, having worked for Opera during the time where we implemented HTML5-compliant parsing, the net result was that we fixed a bunch of site compatibility issues. Implementing it made competing with other browsers easier. (And as you say, parsing HTML is complex, but the rest of the web platform is vastly more so.)


That's a pretty tiny 'everybody else' compared to users, web and browser developers. They seem to be doing mostly ok and I still don't follow your argument that their concerns should somehow reign supreme over those of, you know, the actual everybody else.


It’s a tiny "everybody else" because it never was given a chance to develop.

Maybe you’d also say that the number of people that want to do their online banking with a desktop program that isn’t provided by their bank is a tiny "everybody else".

Yet in places where OpenHBCI exists, many people use it – and there is an ecosystem around it, e.g. KMyMoney can integrate with it, and there’s a small widget for KDE to show your current account balance in your tray.

If we had a machine parseable web, a similar ecosystem would have developed. When embedding links today, we use OpenGraph tags. But why? If the web was directly machine readable, we could’ve directly embedded that content.

Google search shows you as preview excerpts of tables on a page, with an easily machine readable web, similar stuff could’ve happened here.

Maybe instead of only embedding YouTube videos, I would have been able to easily embed any part of any web page into any other. Maybe I would have been able to easily embed any part of any web page in a desktop application. Maybe I would have been able to build addons that easily search over content of web pages, in a structured way.


This would be a more compelling counterfactual if the approach that won wasn't in many senses the most successful technology in the history of the computer industry.


It seems like the web is 10 times more parseable. I feel like almost all content is now delivered via JSON services, and the HTML around it is constructed on the fly and is totally pointless to look at. At least that's the direction I've seen. I don't know how "everyone else" are people that are writing parsers. Seems like that would be a tiny percentage of the population, not "everyone else".


> I implement XHTML1.1, you implement HTML5. We both get 1 month time.

Do you get to use libraries or not?

If you get to use libraries, there are plenty of libraries for both of these tasks, so it will be fairly equivalent. In fact, if you want to really support XHTML well, it will probably be more complex, because you have to take into account namespaces; a tag or attribute is not just a simple string, you have to consider the namespace it's in, so you would have to deal with that.

If you have to write it from scratch, I'd actually also bet on it being quicker for HTML5. The exact algorithm is specified in the standard; you just have to translate that from pseudocode into whatever language you're working in.

In XHTML, you have to look through several standards (XHTML 1.0, XHTML Modularization, XML 1.0 which includes the DTD, XML Namespaces, XML Schema), and then translate from the specification in a declarative style into a algorithm that can actually be used to parse the document. XML parsing is actually quite complicated.

> XHTML was a worthy goal – with it, we wouldn’t have a need to run headless Chrome for tests. We could parse the web, and actually use the data.

What? XHTML wouldn't have replaced the use of JavaScript to add features and load content. It wouldn't have replaced people's inappropriate use of tags. It wouldn't have replaced the fact that web pages are written for human consumption, and so don't generally try to include appropriate annotations on data for processing by other tools.

> HTML5 is amazing for existing browser vendors, developers, and in the short-term, users. But everyone else loses. Horribly.

Imagining that some magical standard is going to make everything better for users is a pipe dream. Providing good data formats and APIs for exporting data is hard, and is a different problem that providing user interfaces and interactivity. There's not some way in which XHTML could have been extended to serve both purposes; they are just too different.

This is why, when people care about providing programmatic access to data, they generally provide two endpoints; one serving HTML, CSS, and JavaScript for humans to interact with, and one for providing JSON for machines to parse. There have been endless attempts to try and make one standard that would work for both purposes, and they've failed because that's just not a good approach for solving the problem; instead, it's better to just have an API that both the UI code (whether on the client or server side) can use, and other developers can use if you want to expose it.


In the long run, we're all dead. We need short term solutions.


No, it wasn't simply a question of backwards compatibility. The W3C wanted (wants?) to pursue its own quixotic vision of a semantic, machine-readable web. One in which the needs of human users, browser user agents and use cases like web apps was of incidental importance at best.

The HTML5 spec effort that gave birth to WHATWG wasn't simply about documenting backwards compatible parsing, it was about vendors like Mozilla and Opera wanting to evolve HTML in a way that added actually useful new features, something that W3C had zero interest in at the time.

Nowadays, the W3C's behaviour seems to be driven entirely by an institutional desire to justify its own existence, and protect its revenue stream and its self-assumed position as the one-true source of web standards, by engaging in bad-faith practices like taking standards produced through the hard work of others, removing all citations, making breaking changes, and publishing it as a competing "standard".


Yes, and today we’re left with a web that you can only parse if you’ve got a few thousand developers and billions of dollars to throw at the issue.

As I wrote below, I offer $100 to anyone that can write a tool that can fully parse and render HTML5, the entire spec, and can a real-life React app, within of 4 weeks, without using any existing library or code for the parsing or rendering.

Doing the same for XHTML and a strict scripting language is easily possible in that time.


That was the situation prior to the writing of the HTML5 spec, not a situation it created. The majority of the web was not, and never would have been parsable as strict XHTML, whatever the desire of the W3C or anybody else. And even for new content, the idea that every hand-authored file would be well-formed, or that every half-baked and buggy CMS would always produce correct markup was a pipe-dream.

If strict parsing had been enforced, we'd have a web where a large proportion of the sites you visited each day would be broken. Or rather, we wouldn't, because complaints from users would have long ago forced browser vendors to implement graceful error handling, and so you'd end up with something akin to HTML5 anyway. Indeed, it was encountering precisely this problem that made vendors like Mozilla and Opera (who, incidentally, never had billions of dollars) to lose faith in XHTML in the first place, not some masochistic desire to keep their browsers' code as complicated as possible.


It would be awesome if the layers of the browser would allow for a more parseable, document-based web. That would be an effort towards standardization.

But are you really advocating for removing dynamic media from the browser? That seems like an incredible step backwards in most regards. In the absence of a desktop toolkit to rule them all, browser standardization is what we are left with right?


> In the absence of a desktop toolkit to rule them all

Qt. Runs everywhere, works everywhere, just fine.


If only some standards group somewhere could mandate that everyone use Qt.


Ah, this gets into gpl vs bsd though doesn't it? I'm not sure what to think about dual licensing. I have a fondness for Qt, but I also like tcl/tk so... its hard not for me to think of that xkcd about standards. https://xkcd.com/927/


React probably uses HTML5 APIs but could probably be re-written to do the same thing without them... I'm not sure anyone here is interested in $100 as they probably wasted at least that much "company" money reading this thread this happy Friday.


The $100 is for someone writing a fully-working parser that handles real-world HTML5 pages, without reusing any existing implementations.

Building an entire new parser is (obviously) much easier for strict languages (e.g. JSON) than for lenient languages (e.g. HTML5).

Which is the problem I have with HTML5, JS, and many similar technologies – they’re so lenient that almost everything is broken instead. We might as well write websites in english prose, it wouldn’t be much harder to parse.


I think you're downplaying the shit show that was XHTML.

It was the epitome of standards people chasing the ideal of a perfect platonic ideal of a standard at the cost of actual usability. E.G. a single parse error in an XHTML document and it doesn't render at all. That all by itself was a deal breaker for many people.


That seems reasonable to me. Would you want your programming language to compile code that has a syntax error?


Yes, it seems reasonable, and the analogy with programming languages seems to make sense. But there's a big difference. With programming languages, you write a program, and when it's correct, you check it in and it's immutable until you check in a new version. Web pages are composited on-the-fly by programs that combine static files, database content, content from third parties, etc. The only way you guarantee a valid output document is if your compositor is bug free and defensively validates any third-party content you might be pulling in. Any bug in the compositor and your website is totally unavailable for users that hit the bug.

This is a great illustration of the concept: https://web.archive.org/web/20060613193727/http://diveintoma...


Thanks for the link. I can certainly see how making the transition now is pretty much untenable, but I'm not 100% sold that it wouldn't be a good idea if everyone had adopted the policy from the start. I can certainly see how it would still be important for browsers to have a mode where they do their best to render the page, but it's less clear that it should be the default. Even if it was the default, then I would expect variations in different browsers ability to recover from errors would lead developers to being much more careful about allowing errors to creep in.


What makes that story (in my link) so compelling to me is that it happened to people who were XHTML advocates. They were the people arguing that browsers should be strict. They were the people writing blogging software The Way It Should Be Done, to ensure maximum XHTML compliance. It was a blog entry that was specifically arguing for strict parsing that it became invalid XHTML due to a bug.

If true believers can make this kind of mistake, how often will it happen to people who are just trying to get work done? People will have bugs in their code sometimes, but what should the failure mode be?


Yes, but if browsers were strict by default (or at least by default in dev mode), the bug in his compositing software likely would have been found and fixed much earlier, and it wouldn't have persisted into production.

So again, I understand that it's nearly impossible to make the transition now, but that doesn't mean it wouldn't be preferable. And there's no reason the transition couldn't still be made, but much more slowly, and perhaps without ever making the big browsers reject by default.

I don't know if there are other disadvantages to XHTML, but if the strictness issue is the only one, then it seems like there would still be value in slowly transitioning over.


Web pages are information held inside containers, not code.

The analogy is a classic example of the "Everything must act like a compiler" fallacy.

There are many situations in which compilation is hopelessly inappropriate as a user model.

Not only is the web one of them, but the hypothetical semantic web is also one of them.

You can't force semantics into compilable tokens. The suggestion that you can - and should - is nonsensical.


I’ll remind you that the AMP spec has an extremely strict validation requirement: https://www.ampproject.org/docs/troubleshooting/validation_e...


Would you want any and all systems to have identical failure modes?


That’s great if you’re either an existing browser vendor, using a browser, or developing a broken website.

But if you actually try to write a new browser from scratch, or a tool to scrape websites, you’ll learn to love XHTML, and hate HTML5.

The same that applies with human languages applies here as well. Writing a vocaloid for japanese is a high school programming class project. Writing a TTS for english takes thousands of Google developers years. Writing an XHTML1.1 parser and renderer takes a month. Writing an HTML5 browser takes thousands of developers years.

The WHATWG specs prevent the web from ever evolving further – we’re stuck with opaque websites and no way to build new technology on top of it. Building a crawler is a task of years, and semantic data tags are impossible to parse, because no one uses them right.

The only reason we can automate any parsing of websites is because either browser vendors spent billions on crawlers for their search engines, or if we run a full browser, or if we use the opengraph tags that Facebook forced on websites.

With XHTML1.1, Chrome and Firefox headless would never have been necessary. Imagine how much time and computational cost you could save.


"That’s great if you’re either an existing browser vendor, using a browser, or developing a broken website." - and that's the whole point, in the marketplace the convenience of these people matters much, much, much more than the convenience of those people who want to "try to write a new browser from scratch, or a tool to scrape websites."

If you want to scrape websites or show them in a browser, then you have to follow the needs of makers of these sites, because you need them and they don't need you. If you want to go to the right and they want to go to the left, you either follow them or go alone and become unable to scrape or browse their content. If there's a feature that they want to use that makes your parsing more complicated, tough luck, that feature is going in as long as somebody (e.g. major browser vendors) will agree to make it work.


The huge underlying assumption here is that people would use XHTML as a standard instead of using HTML4.01 with browser-specific extensions, which is what actually happened. XHTML didn't add any value to page authors. It added hugely indirect value to readers. The only people XHTML helped were tooling authors and browser vendors. It's very hard to market that as a value-add.


But the web already contains billions of pages which do not conform to XHTML. Any browser or scraper would still have to able to parse those to be of any use. Adding XHTML would just be an additional parser frontend, it would not simplify anything.


That’s not true – the WHATWG has already deprecated some HTML specs, and older HTML pages already break today.

XHTML would have worked the same way – after a few years, you can deprecate the old parsers.


As far as I know, WHATWG have deprecated some elements not widely used (like "isindex", "font"), but the documents using these elements will still be readable even if not exactly as the author intended. Moving to XHTML, on the other hand, would make billions of pages totally inaccessible.

There are people who are now dead who have pages on the internet. These pages will never be updated.


font still works. isindex does not. It's extremely rare for a cross-browser HTML element to get removed but isindex is one of those.


blink also doesn’t work anymore, and neither does marquee. Several of the frame attributes are broken as well. noscript doesn’t always work reliably depending on browser.


Marquee works for me in Gecko and Blink. Didn't test the other two engines.


You keep repeating over and over that the big advantage to XHTML is "not having to run headless browsers", which I don't understand. I use Selenium all the time for my job, it's not great but it's... not horrible? It's fine, it's not some massive inconvenience, and it's definitely not worth getting rid of HTML5 to abandon it, when, as others have pointed out and you keep ignoring, the practical effect would be that the web would be useless for actual humans without plug-ins.

I guess I don't get your huge bone to pick with Selenium.


> I use Selenium all the time for my job, it's not great but it's... not horrible? It's fine, it's not some massive inconvenience,

Try running hundreds of tests at the same time, to actually get fast results when running a full test suite.

Right now running a small testsuite takes here over 2 hours, 99% of the time is spent in the browser processes.

And that’s not nearly close to 100% test coverage.

> not worth getting rid of HTML5 to abandon it

You don’t have to – XHTML isn’t the only strict spec out there, AMP is another, React also enforces strict syntax in JSX templates. AMP fails to render anything if there’s even a single mistake, React fails at build time.


Parsing isn't the difficult part and above the parser both XHTML and HTML involve the same complexity.

You can get an HTML parser off the shelf. A browser vendor (Mozilla) even funded one for non-browser purposes before adopting it for Firefox.


> Writing a vocaloid for japanese is a high school programming class project. Writing a TTS for english takes thousands of Google developers years.

It didn't take "thousands of Google developers years" to teach a computer English spelling rules. Indeed, even a programming class assignment could do that: you can cover most cases by just looking the pronunciations up in a dictionary. (The number of quirks and inconsistencies makes English spelling quite hard for humans to memorize, but computers are rather good at lookup tables.)

Even if you consider the actual Vocaloid software, which was developed by a team at a large corporation, there are two factors differentiating it from English TTS that make the latter much harder:

1. Japanese has much simpler phonetics than English, with a smaller set of phonemes and (somewhat oversimplifying) only using open syllables. So it's easier to consume and produce, for both computers and humans, but at the cost of being a less efficient encoding: Japanese tends to require a lot more syllables than English to express the same concept, and there are a lot of homophones.

2. Vocaloid sounds robotic. It's gotten a bit less so over time, but it still doesn't come close to passing as human. If you're okay with robotic, English TTS software has existed for a long time, starting many decades before Google was founded. The hard part, the part that requires neural networks and massive computational power and Google and still has yet to be perfected, is making it sound human.

By the way, although vocaloid software would be given phonetic input, normal Japanese writing uses kanji (i.e. Chinese characters), most of which have multiple unrelated possible pronunciations. Determining which pronunciation applies to each character in a given piece of text is nontrivial, and sometimes even depends on context or meaning.


XHTML lost in the marketplace. This was due in part to lack of developer adoption and the fact Internet Explorer ignored it. You could argue that the web community should have forced this through but that simply wasn't working at the time.


I'm pretty sure javascript heavy pages would exist even in a strict parsing world.


This summary glosses over several events that alienated browser vendors and web developers from the W3C and makes the W3C sound like the good guys.

One that springs to mind is the complete debacle around XHTML 2. The link above ( http://diveinto.html5doctor.com/past.html ) is worth reading to understand things from the WHATWG's perspective.


> the W3C, still being relied on by the other actors on the web that aren’t the 4 largest browsers

Why do other actors need to rely on the W3C? Why can't they use the WHATWG specs as well? You say they're more complicated, but if that's what the browsers do then that's what they do.


Specification vs Documentation...


Because the XHTML spec would have made a lot of things easier.

I can write an XHTML1.1 parser and renderer in about one month.

Writing an HTML5 parser takes thousands of developers 5 years.


I can make a self-driving train far more easily than I can make a self-driving car.

You are comparing two things that aren't equivalent. The living standard provides a lot more value than the old specs, and that's why they won.


The living standards winning is something we should see as tragedy, not as something good.

We’ve made parsing the web something that only massive corporations can do. We’ve made building a browser so complicated that even Opera gave up after the complexity of HTML5, and we’re now left with only 4 major browser engines, of which 2 share a significant amount of code.

Do you see an innovative ecosystem there?

We’re left with a web that can’t be parsed, we’re left with a web where, to run Selenium tests, we have to run Chrome headless, because nothing else can even attempt to render websites anymore.


> Do you see an innovative ecosystem there?

Yes! Everyone says that the specification approach is what was severely limiting innovation, and when they took the living specification approach instead that's when web innovation took off again.


Oh? How many tools do you know that can parse a current HTML5 react app without importing any existing browser?

Every tool we have to use on the web relies on 4 existing tools that only major companies can build.

I’ll pay you $100 if you manage to write, in 4 weeks, a browser, from scratch, that renders a real life React app accurately, including all content, without importing a single bit of code or libraries from existing browsers or web tools.


> Oh? How many tools do you know that can parse a current HTML5 react app without importing any existing browser?

I don't recall XHTML deprecating JavaScript- pray tell, how would you render the XHTML version of a react app without a browser, or a JS engine as a bare minimum? (X)HTML is orthogonal to Js/react.

I could develop a self-driving car for $50 000 (instead of millions) if human drivers and pedestrians started behaving in well-defined patterns, following strict rules and stopped doing stupid, unexpected things. I'd really love that, but it's not going to happen even though there already are "standards" in the law books.


You asked about innovation, not how easy it is for new people to enter the market. Innovation in browsers has definitely gone up and as a user I feel like I'm winning from this with better websites and more powerful web apps.


How many new browsers do you see? Is that innovation in browsers? Browser competition and innovation is at its lowest ever. Chrome holds 2/3rds usage globally.

I don’t want more powerful ways to display ads, I want to get more content, better connected, without any of the fluff around it.


> Browser competition and innovation is at its lowest ever.

I don't care directly about competition, because that's just a means to an end, and the end I care about is innovation, and I don't agree that innovation is low - I think it's high. New HTML features are coming out in a fast continuous stream, unlike how it used to be.


What does it even mean to parse a React app? Crawlers can't resolve the halting problem, either.


> Crawlers can't resolve the halting problem, either.

And yet, that’s where we’re at today. Blogspot posts require JS. So many other pieces of content are similarily built.

And with phatomjs gone, we’re now running entire headless browsers just to test if websites are rendered correctly. It’s insanity.


Yes, and before that there were maybe 1.5 viable ways to render flash content.

Then you need to take that `.exe` and try and pull out it's content. Game, app, whatever.

We have a web that can be parsed, with difficulty, instead of a web without half of the content we want to parse.

The web was being replaced - this is what saved it, like it or not.


> Writing an HTML5 parser takes thousands of developers 5 years.

This is demonstrably untrue. Servo's HTML5 parser, html5ever, was largely written by a single developer within a year. (Yes, it's not a month, but it's also not 5 years.)


And wouldn’t it have been easier to write it without any of the leniency it has to expose? Wouldn’t it have been easier if all tags were either ending in /> or followed by a closing tag? Wouldn’t it have been easier if the syntax was formally defined as ABNF, and could be translated into code in a matter of days?

I believe it would have been. And I believe that making the web easier machine readable, and making it easier for people to develop tools working with the web, would be a valuable goal.


> Wouldn’t it have been easier if all tags were either ending in /> or followed by a closing tag?

Not really.

> Wouldn’t it have been easier if the syntax was formally defined as ABNF, and could be translated into code in a matter of days?

No, working from an spec written as an algorithm is easier than working from ABNF plus some inconvenient prose constraints.

(With the exception of template element support, I wrote the HTML parser used in Firefox and Validator.nu.)


> No, working from an spec written as an algorithm is easier than working from ABNF plus some inconvenient prose constraints.

That sounds very unlikely.

I’ve implemented my own parsers for countless specs – plaintext or binary, and the parsers that are written as imperative algorithms are insanely complicated to implement as functional implementations.

I end up with horrifying code, while the specs written as ABNF are much easier to translate into pattern matching code.

The specs written as algorithm only work fine for a single type of implementation IME, while the ABNF specs work equally well for all types.


Way back in the day, HTML was implemented as an application of SGML. SGML was a quite complex markup format, that had lots of features that were kind of complex to implement, and so web browsers didn't actually implement all of SGML, just that which was necessary for HTML and the HTML found in the wild. However, the HTML found in the wild was frequently invalid, so browsers had to implement some clever rules to do something reasonable with invalid markup.

Eventually, people thought it would be nice to have a simpler, easier to parse markup format, with a proper specification, and developed XML. Of course, once you have a shiny new thing, people will come and start bolting things on to it, and so they added features like namespaces, and they included some of the worse features from SGML like the DTD (it turns out, you don't generally want a document to reference its own schema and entities; an application should know what schemas it can accept, and having a URL for a schema in a centralized location means lots of poor implementers would actually download that, and entity expansion is just a nightmare).

However, XML wasn't compatible with HTML, and it certainly wasn't compatible with HTML in the wild. XML parsers are required to reject any invalid markup. The W3C developed XHTML as a way to have a subset that could work in either HTML mode or in XML mode; the idea was that people could start moving to XHTML in HTML mode, and then once everything was cleaned up they could switch to XML mode with strict parsing.

The problem was, strict parsing was never a benefit for either publishers or users. One little bug somewhere in a template substitution which allowed an unquoted < sign could cause the whole page not to parse, and users just be left with an error message. Without completely changing how the majority of web apps worked, there was no way to ensure that all of your content would be strictly compliant XML without the chance of breaking.

In the browser world, XHTML was implemented, with strict parsing in XML mode, but almost no one used it.

At this time, the browser landscape was pretty bleak. Netscape had died. Mozilla was in the process of building their new browser on a completely new engine, but early versions were fairly bloated and slow; it required a rogue group of developers within Mozilla to fork just the browser portion without a lot of the other functionality to produce what is now Firefox. Opera existed but had a tiny sliver of the market share; it didn't help that the free version came with ads, or you could pay for it without ads, while all of the other browsers were just free. IE was dominant, and eventually captured a huge percentage of market share, and then Microsoft just rested on their laurels and pretty much stopped development.

Soon enough, Apple came around and forked KHTML to build WebKit and Safari. In doing so, they did a huge amount of reverse-engineering effort to make their browser compatible with all of the parsing and layout quirks in IE and Gecko; the standards are not at all sufficient for compatibility. This gave them a browser that could actually be used on the majority of web content. With the introduction of Firefox instead of the bloated Mozilla browser, there was actually competition in the browser landscape again.

Meanwhile, the W3C decided that the failure of XHTML to be adopted because it met no ones needs was not sufficient; they went on to start working on XHTML 2.0, a backwards incompatible change with maybe a few nice features for document publishing but which didn't address any of the needs that web developers wanted, like rich interactive web apps.

So the WebKit, Mozilla, and Opera developers decided to get together and actually start providing a spec for what would actually work in the real world on real web content. They called this group the WHATWG. The W3C was not interested in working on this; the W3C group insisted on continuing work on the backwards-incompatible XHTML 2.0. Inspired by the clever algorithm the WebKit developers had come up with for parsing invalid content, they actually wrote up something a lot like that into a spec (with some improvements), and this spec eventually was implemented by all of the major browsers, providing much more robustness and consistency between browsers in how content was handled.

Other features that people wanted, like the ability to draw on a canvas, were prototyped in browsers and specced out in the WHATWG group. Some features were adapted from browsers that had already implemented them; for instance, Microsoft had implemented XMLHttpRequest, which turned out to be hugely useful for interactive web apps, so the WHATWG wrote up a spec for this and other browsers implemented it.

Google eventually wound up forking WebKit for Chrome, then merging back in, and eventually forking again into Blink, and joined this group as well.

Finally, the W3C realized that the work is was doing was irrelevant. No one was interested in implementing XHTML 2.0. What everyone wanted were the new features the WHATWG HTML specification, new features like the canvas, and so on. So they agreed to take the current WHATWG spec and edit it into HTML5.

However, it didn't take long for this to break down again. There were people in the W3C who objected to some of the changes made in the spec for the purposes of matching up with the real world. For instance, there had been some accessibility features like the "longdesc" attribute which were specified as containing a URL pointing to a page with a longer, more detailed description of the item in question for accessibility purposes (something that could, say, contain markup, when "alt" wouldn't be sufficient for a description). However, no browsers had ever actually implemented any reasonable way to get to this, and a survey of web content found that even if you did try to implement it according to the spec, very little content actually used it, and a large amount of the content that used it used it incorrectly, pointing to broken URLs or just including plain text like the "alt" attribute. So the WHATWG spec dropped this, and recommended other ways to link to descriptions which would show up even without special accessibility tools. One of the problems with specialized accessibility attributes is that people who aren't using screen readers can't easily test them out, so it's easy to bit-rot, but providing a normal link and annotating that so that screen readers can link it to the image make it possible to see in a normal browser.

Anyhow, some people in the W3C objected to this, and so rather than just providing an edited version of the WHATWG spec, they started tinkering with the spec themselves, adding things back in, removing some things that had been in the WHATWG spec. The W3C structure is very bureaucratic, and it has all kinds of members who are only peripherally involved with any actual tooling for web development, so it makes it very easy for various people with big egos but no real skin in the game to get involved in the process, while the actual browser developers who would be implementing the features can be shut out of the process.

So, eventually the cooperation between the WHATWG and W3C died down again, with the W3C publishing what it wanted, and the actual browser vendors continuing to work on their living standard document that is a much closer representation of the real world.

And this seems to be another case of yet another attempt to reconciliation and break down. The W3C decided agreed to start with the WHATWG DOM specification, then decided to make some incompatible changes without very much justification, and is now trying to publish a new version.

I think that in a lot of ways, there's an ego thing going on here. The W3C was originally started precisely to specify HTML, CSS, and things like that. While their CSS working groups have managed to stay pretty reasonable and are willing to work with those doing the actual implementation, their HTML and DOM groups keep on being hijacked by people with particular agendas, people who won't work in good faith to try to resolve differences reasonably, and people who think that because they're the W3C, they "own" the spec and so think that the WHATWG is just a rogue group, as opposed to a group of the people who actually have the most skin in the game because they have to actually implement the browsers that billions of people use without breaking a hugely complex and diverse amount of content out there.


> Way back in the day, HTML was implemented as an application of SGML.

[citation needed]

I'm unaware of basically any implementation of HTML treating it as an application of SGML; the only notable case I'm aware of is the old HTML validator.

Tim's original implementation of HTML didn't treat it as SGML.


> I'm unaware of basically any implementation of HTML treating it as an application of SGML; the only notable case I'm aware of is the old HTML validator.

Plugging my XML Prague 2017 paper on a SGML DTD for W3C HTML 5.1 here [1].

[1]: http://sgmljs.net/blog/blog1701.html


I was looking for that earlier, it's a fascinating paper, thanks! :)


Sorry, maybe I should have said "inspired by" or something.

You're right, I'm not aware of any actual implementation, outside of validators, that treated it as such.

I did actually say a little later that no web browsers actually implemented it as SGML, but the sentence you quote could cause confusion; but it's too late for me to edit now.


Why don't the browser companies just entirely ignore the W3C from now on?


In part, I think it's because there is still good work that goes on under the W3C umbrella in other areas; the CSS working group has managed to stay reasonable, learn from the mistakes of some over-engineered past standards, and continues to work with implementers.

Also, there are reasons to want some of what the W3C provides that the WHATWG doesn't. The W3C has many more member companies, and it can get them to sign off on patent rights so there's less likelihood of some one of Adobe's patents on page layout in InDesign suddenly being infringed by web browsers due to something in the spec.

And finally, I think that the W3C wants to stay relevant and so it keeps on trying to work from a basis on the WHATWG spec, and promising it will be good this time, and then it goes off and pulls this stuff again.


> The W3C has many more member companies, and it can get them to sign off on patent rights so there's less likelihood of some one of Adobe's patents on page layout in InDesign suddenly being infringed by web browsers due to something in the spec.

The patent policy only has commitments from members of the WG who developed the spec, so only have coverage from Adobe patents if Adobe is a member of the WG. (As it happens, Adobe still has one representative within the CSS WG, who happens to be one of the chairs.)


> Inspired by the clever algorithm the WebKit developers had come up with for parsing invalid content,

What was the clever algorithm?



It's both true and false. W3C does copy/paste some stuff, but it also seems to add some stuff (Especially related to accessibility and internationalisation) that the whatwg hasn't historically cared for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: