> I'd say the W3C wasted a huge amount of time pursuing quests of purity (XHTML)...

Latty · on April 13, 2018

And if we were all using XHTML1.1 now instead, yes, your parser would be easier, except all the richer content would be in flash, and all the web apps would be desktop applications, and you wouldn't be able to parse that at all, even with a full browser.

You are acting like everyone would just stop and wait for you to make your dream implementation that's ideal - that's not how the world works.

WHATWG was an admission that we can't stop it, so we might as well embrace it. Embracing it has resulted in browsers being far more consistent, and new features being a shared part of the web platform, and not siloed off in plugins and other platforms. That's why W3C is irrelevant now.

kuschku · on April 13, 2018

You’re assuming XHTML1.1 would never have evolved further, never have gotten more content.

And Flash, despite its flaws, would have been a much better starting point for rich content than the ecosystem we have today.

Many of the features Flash provided are only available in browsers today through babel.js transpilation. As result, we’re stuck with a language without stdlib and broken syntax.

We’re stuck with a document model that’s impossible to work with or parse, and with impossible layout management.

If you want web applications, it’d make much more sense to port the Android layout XML format to the web than to attempt to use HTML5 for it, because HTML5 is insanity for building applications.

> and all the web apps would be desktop applications

I don’t see that as anything bad.

The web is for documents, and lightly interactive content. All the rich applications on the web are opaque to any crawler I could write anyway, as I just get "You need JS to view this React app". Desktop applications would be just as parseable, except they would also be less of a resource hog.

tptacek · on April 13, 2018

The idea that standards groups exist to push for top-down reconcepting of how the world should work is a common one, and is also a good reason why standards groups fail. Your idea of what the best outcome is won't be the same as every other stakeholders, and no one stakeholder will have exactly the same idea of the best outcome as the market will.

Ultimately, the market will win, no matter what your standards group says.

The point of standards is interoperability, not rationalization. When standards groups try to rationalize technologies real people work with, they cease to provide value, and instead become obstacles that real engineers end up laboriously working around.

kuschku · on April 13, 2018

When I was a child in elementary school, a standards committee decided to change my native language.

They replaced the spelling of most words, and many grammatical rules.

We were forced to obey these changes, any use of the old rules was counted as mistake in school.

Back then, many older books were still using the old rules.

By the time I left high school, almost no books with the old rules were left. All had been reprinted. All newspapers had switched. Autocorrect programs had been updated with the new rules as well.

In a matter of 8 years, an entire language had changed its orthography and parts of its grammar, top-down, and it worked out fine.

I’m sorry, if an entire human language with 120 million speakers can be updated top-down like that, a web spec can as well.

happyopossum · on April 13, 2018

Apples and oranges - when you introduce a government mandate, you remove the market. @tptacek's aregument is clearly about market forces in publicly defined standards, not government enforced ones.

I'm pretty sure the last thing anyone really wants is what we'd wind up with if web standards were left to government dictates...

kuschku · on April 13, 2018

There’s no need for a government to enforce standards on the web – there’s already an oligopoly that can do it on their own.

In fact, there’s a single company that can just outright dictate web standards, because they hold almost 70% of the browser market: Google.

msoucy · on April 13, 2018

What language is this? This sounds fascinating.

kuschku · on April 13, 2018

My example was the implementation of the German orthography and language reforms between 1996 and 2006 (I went to elementary school in 2002, when implementing it was still in progress, and most stuff was still using the old spelling, I left high school in 2014).

But French has their Academy, which has even more power over language, and afaik, Spain supposedly has similar governing bodies.

neuromantik8086 · on April 13, 2018

Meh- the French Academy has power over prescriptive grammar sure, but it has almost no power over what descriptive linguistics finds. Lots of Arabic and Verlan has made its way into everyday parlance.

I'm not really sure that the French Academy is really that much more effective than Strunk & White is for English speakers. It primarily seems to be ceremonial / an expression of French pride.

jstewartmobile · on April 13, 2018

If you want a more tech-related example, Jobs said "no flash on iPhones", and in a few years... poof! no flash.

Latty · on April 13, 2018

The counter to that is pretty obvious, because if you remember Jobs also said "no native apps on iPhones", and then in a few months... poof! an app store.

Flash was pretty much dead anyway, and the web platform had advanced enough to mostly replace it at that point. That wasn't true for native apps.

If you want to make a standard, it has to let people do the things they want to do. Otherwise, people will just use a different (or no) standard.

jstewartmobile · on April 15, 2018

I'm not really getting into the standards thing here--just throwing some ammo to the underdog.

My only point is that there are only a handful of companies with the cash, the talent, and the inclination to tackle these things, and most of them are near if not total monopolies, so as long as what they put out there isn't a blatant kick-in-the-nuts, most of us will just accept it.

iPhone was a compelling product, didn't have flash, everyone migrated to Javascript ASAP. Google is practically a monopoly, and when webmaster tools tells people to jump, watch everyone piss away a weekend to add microformats and shave 5% off of a few 40k images.

Serfs. We are all serfs.

tptacek · on April 13, 2018

That sounds horrible.

kuschku · on April 13, 2018

It was amazing. The new orthography is much simpler, and has far fewer insane rules or exceptions. And most people that have seen the transition, but were born after it, or went to school during it, agree.

I know it can work on this scale, I’ve seen it IRL. Many languages do stuff like this, German has the council of German language, and French has their Academy.

You can do the same on the web. You just need to have all vendors working together to actually do it.

tptacek · on April 13, 2018

The idea that a bunch of standards group officials can decide for the world that web pages are simply lightweight content publishing mechanisms and that real applications should be build exclusively in Flash and that that worldview can be ratified and mandated by browser vendors does not seem amazing to me.

At any rate: the Internet is a market system, not a top-down autocracy.

xg15 · on April 13, 2018

The alternative (and current reality) is that the same things are decided by about four companies in an entirely intransparent manner.

At least the W3C had processes and a wide array of members.

tptacek · on April 13, 2018

Isn't that just theater? None of them can tell Apple and Google what to put in their browsers; in fact, if they can't convince just one of the big 4 browser vendors to do something, their standards have no meaning at all.

jancsika · on April 13, 2018

It's even more work than that-- check out caniuse for SVG fonts:

https://caniuse.com/#feat=svg-fonts

They had support in both Safari and Chrome, but never in FF or IE (nor Edge). Chrome eventually dropped the support.

So I'd say if you can't get all four to implement the feature then you might as well call that part of your spec a "living standard." Those features are going to get way fewer eyeballs, fewer bugfixes, fewer reviews, fewer pieces of documentation, etc.

Latty · on April 13, 2018

Uh, WHATWG is an open process - they have a similar level of control over things that W3C had.

If you want to try and claim W3C ever had the power to enforce people following their specs, IE6 would like to have a word.

jstewartmobile · on April 13, 2018

The Internet is a network. The web is an oligopoly. Google, Google-by-proxy, and Apple fill the dog bowl, and the rest of us eat from it because it is there.

Tomte · on April 13, 2018

If you‘re talking about German, it was not amazing, but a cultural catastrophe, and an extra-legal totalitarian nightmare.

kuschku · on April 13, 2018

I assume you’re older than 22? There’s pretty much a strict split at around that age. People older seem to consistently hate it, people younger seem to consistently like it, because the new rules are much simpler.

Previously, Gruß and Kuß had no info about how long to pronounce the u – Gruß and Kuss do. And until 2017, capitalizing them into GRUSS and KUSS lost this information, now GRUẞ and KUSS keep it.

Previously, for many words, the rules when to split the word, when to write them together, when to use – was insanity. Now it’s all in a few easy rules.

And you have to remember, this wasn’t the first time German went through such changes – ever since the advent of the printing press, when a written German language was basically "invented" from the many dialects that existed, until today, there have been proponents of a prescriptive language evolution, and they’ve had lots of influence over time.

When you use Tarnen, Verfasser, or Absender, Abstand, Bücherei, Augenblick, Leidenschaft, Entwurf or Briefwechsel, Rechtschreibung or Tagebuch, Grundlage, Altertum, Erdgeschoss, tatsächlich or Hochschule, all these words were defined top-down. (All these words are just from Philipp von Zesen, Christian Wolff, and Joachim Heinrich Campe)

A massive amount of what we consider "German" today was defined and changed top-down, and without these changes, German wouldn’t be recognizable.

Tomte · on April 13, 2018

You‘re misinformed.

Yes, the German language has had several big changes, but until the reform we‘re talking about it was linguistically „proper“ in that the existing language was described and codified. It was bottom up.

In this reform some non-elected people (who just a few years earlier had said themselves that there job wasn‘t to invent German, but to describe existing use and trends) invented a whole new orthography from scratch. The new rules have never been in use anywhere throughout the German-speaking lands.

They were and are pure fiction.

In linguistics that‘s how you tell a layperson: they think linguistics is proscriptive. Now it seems to be... :-(

And of course people under 22 don‘t care. They have never learned proper German.

kuschku · on April 13, 2018

You mean, just like in many other languages? According to Wikipedia, French, Icelanding, Spanish, Swedish, and a few more have had varying degrees of prescriptive language standardization.

> Yes, the German language has had several big changes, but until the reform we‘re talking about it was linguistically „proper“ in that the existing language was described and codified. It was bottom up.

I just explained why that wasn’t the case. Many linguists in the past have intentionally invented words (see the ones I mentioned) to make the language simpler, and stricter.

And the same continued until today – the drug store chain Rossmann has been a constant supporter of linguistic prescriptivism, has sponsored groups supporting it, and has been using these concepts in all their published material as well. Many other companies engaged in this as well.

The language has never been defined by the people speaking it, but always by the journalists writing it, the linguists describing it, and the companies influencing it.

And German as a whole was created, as pure fiction, by people trying to publish books across the whole of Germany at a time when everyone spoke local dialects.

At no time has German ever been a bottom-up language – and if we already let our language be influenced and shaped by companies, by media – why not at least use similar influence to make it simpler?

Having a language be simple to use is more important than some fake emotional value of being "natural".

Tomte · on April 13, 2018

You simply don't understand what I have written. I think we can leave it here.

I don't care about your opinion that it's "fake" and "emotional".

Language is a core part of my being, and a fascist power-grab killing my mother tongue is simply a crime against humanity. It's no different from how the Turks have been treating the Kurdish language.

I have only weak hope, but still hope, that we can someday reverse this. Violently or non-violently.

ygaf · on April 13, 2018

Do you believe languages are meant to live forever?

xg15 · on April 13, 2018

But it worked.

krylon · on April 14, 2018

I am not sure if this is still the case today, but I remember that not too long after the new orthography / grammar rules were passed, two major news publishers announced that they would return to the old rules.

Also, my sister is a linguist, and I can trigger her going on a long rant just by mentioning the Rechtschreibreform. ;-)

(Personally, I think some of the new orthography rules are much simpler and consistent, so I use them. The rest I basically ignore, unless a spellchecker nags me about it.)

tptacek · on April 13, 2018

The HTML5 person's crawler will parse some significant fraction of real websites, and the XHTML one won't, because people write HTML5 and not XHTML, even if you as a tool vendor would greatly prefer otherwise.

kuschku · on April 13, 2018

And yet, forcing people to implement opengraph tags, forcing people to drop flash, forcing people to use HTTPS, forcing people to drop Symantec certs, forcing people to drop SHA1 certs – so often the actors behind the WHATWG have managed to get website authors to change what they use.

Hell, Google has AMP, which is far more intrusive than XHTML ever was, and yet, they’ve managed to get every major website to implement it. https://www.ampproject.org/docs/troubleshooting/validation_e...

And yet, somehow, implementing some stricter spec is supposedly impossible?

bjelli · on April 14, 2018

There' s a Big difference between "every major website" and "the Web".

"every major website" means 100 companies with skilled developer who can and will react to changes in browsers quickly.

"The web" consists of millions of websites maintained by individuals and small organisations who have no resources to update the way their web pages are coded every year. It contains HTML generated 10, 20, soon 30 years ago. It contains that one app in your intranet with the table layout that you can't replace and that IOT thing you connected to your home Wi-Fi 7 years ago that has no way of upgrading its web interface.

A browser that looses access to "the web" is worst than useless.

bonesss · on April 14, 2018

There's also a federal procurement picture: big governments making big purchases aren't fans of incompatibilities and standardised solutions.

For a company like MS losing access to "the web" could keep a lot of people from becoming VPs...

blablabla123 · on April 13, 2018

IMHO XHTML is pretty painful. Even if you say: okay, there is a server-side auto-generated markup tree and we can formally verify what happens. There is now a solution to that and it's called (server-side) React which is basically a (useful) alternative to XSLT. Except that it outputs HTML5.

Even if you argue that the specs are huge. Just compare book sizes about XML, XSLT vs HTML5 and CSS, JS, React. When I actually tried to do some useful work with XSLT (which needs to be mentioned here IMHO), I realized that the - less painful - 2.0 version is hardly implemented by anyone.

Regarding the parsing: X(HT)ML lexing is ridiculously easy, for HTML5 it's slightly more difficult but not tough at all. You just need to keep a list of closing vs self-closing tags. Not talking about building in fault-tolerance, that would be tough, yes, even tougher for XML!

> XHTML was a worthy goal – with it, we wouldn’t have a need to run headless

> Chrome for tests. We could parse the web, and actually use the data.

> OpenGraph tags would never have been necessary. We wouldn’t need to throw

> DNNs at rendered output of a browser just to parse data.

Yes and no. If you use CSS for styling, the answer is no. If you require JS to show the initial show/page, the answer is no as well. But yeah, if you use the whole XML machinery with XSLT and possibly even XPATH, then you would be kind of right. I mean, as long as we properly handle the schemas and dtd's - which almost no parser does AFAIK. So it's true, one can do pretty bad-ass stuff with all the X*. But tooling and library support is not good and has never been. XSLT 1.x is insanely difficult to use and XSLT 2.x hard to fully implement I guess.

ttepasse · on April 13, 2018

> OpenGraph tags would never have been necessary.

Small correction: XHTML2 had the Metainformation Attributes Module [1]. That then became RDFa in (X)HTML, practically the same syntax and processing model.

Facebooks Open Graph stuff is claimed to be RDFa. When I tested it then their Parser did not really do RDFa processing - other CURIE prefixes for the same URI weren't recognized, if I remember correctly.

But in effect Open Graph meta information would have looked the same in XHTML2 as in todays WHATTF HTML.

For your other argument I agree. WHATWG (and dumb modern style of development) reduced the democratising aspect of the web. But of course the people of WHATWG word for billion dollar companies, which want to have a moat to centralize behind.

[1] https://www.w3.org/TR/xhtml2/mod-metaAttributes.html

olavk · on April 13, 2018

The HTML5 crawler would be able to crawl the web, the XHTML crawler would be able to crawl 0.0001% of the web.

XHTML would only improve the web if all existing HTML went away or was changed to XHTML. Since this is never going to happen, XHTML does not simplify anything.

pvg · on April 13, 2018

HTML5 is amazing for existing browser vendors, developers, and in the short-term, users. But everyone else loses. Horribly.

I don't understand who the 'everybody else' is in this case and what and when their horrible losing will be.

kuschku · on April 13, 2018

People trying to build new tools that parse the web.

Try building a crawler without reusing an existing browser engine.

Try running unit tests against your own web projects with Selenium without running a headless browser.

Phantom.JS gave up because they couldn’t keep up with the complexity, and Chrome headless "just works".

Opera gave up on their own browser engine because of the complexity of parsing HTML5 accurately, when the spec is just "whatever Chrome does".

We’ve thrown away an entire ecosystem, just for more flashy graphics.

simonw · on April 13, 2018

One of the most valuable aspects of HTML5 is that it defines a parsing model for "broken" HTML.

This means tat, for the first time, it's possible to build a brand new HTML parser that has a high chance of working against all existing HTML without needing to first reverse-engineer existing browsers.

Remember, when HTML5 was first designed Internet Ecplorer was by far the most widely used browser. And IE was closed source. If you wanted to build a parser you needed to first reverse engineer IE and figure out how it handles invalid HTML.

The HTML5 spec fixed that. The thing you are complaining about here (HTML5 making it harder to build a new browser from scratch) is one of the things HTML5 actually solved!

detaro · on April 13, 2018

I very much doubt Opera gave up because of "parsing HTML5". Parsing HTML5 is well-defined, certainly better than what was there before (unless you are willing to say "if you don't write perfect XML you don't get to be on the web", but good luck with that – no browser ever was and never will be in the position to do that)

The vast majority of the difficulty of the web platform is in the layers above, which don't care if the DOM they are looking at came from HTML5 or XHTML: layouting/rendering, interactions of JS and DOM, ...

annevk · on April 13, 2018

Indeed, having worked for Opera during the time where we implemented HTML5-compliant parsing, the net result was that we fixed a bunch of site compatibility issues. Implementing it made competing with other browsers easier. (And as you say, parsing HTML is complex, but the rest of the web platform is vastly more so.)

pvg · on April 13, 2018

That's a pretty tiny 'everybody else' compared to users, web and browser developers. They seem to be doing mostly ok and I still don't follow your argument that their concerns should somehow reign supreme over those of, you know, the actual everybody else.

kuschku · on April 13, 2018

It’s a tiny "everybody else" because it never was given a chance to develop.

Maybe you’d also say that the number of people that want to do their online banking with a desktop program that isn’t provided by their bank is a tiny "everybody else".

Yet in places where OpenHBCI exists, many people use it – and there is an ecosystem around it, e.g. KMyMoney can integrate with it, and there’s a small widget for KDE to show your current account balance in your tray.

If we had a machine parseable web, a similar ecosystem would have developed. When embedding links today, we use OpenGraph tags. But why? If the web was directly machine readable, we could’ve directly embedded that content.

Google search shows you as preview excerpts of tables on a page, with an easily machine readable web, similar stuff could’ve happened here.

Maybe instead of only embedding YouTube videos, I would have been able to easily embed any part of any web page into any other. Maybe I would have been able to easily embed any part of any web page in a desktop application. Maybe I would have been able to build addons that easily search over content of web pages, in a structured way.

tptacek · on April 13, 2018

This would be a more compelling counterfactual if the approach that won wasn't in many senses the most successful technology in the history of the computer industry.

coding123 · on April 13, 2018

It seems like the web is 10 times more parseable. I feel like almost all content is now delivered via JSON services, and the HTML around it is constructed on the fly and is totally pointless to look at. At least that's the direction I've seen. I don't know how "everyone else" are people that are writing parsers. Seems like that would be a tiny percentage of the population, not "everyone else".

lambda · on April 13, 2018

> I implement XHTML1.1, you implement HTML5. We both get 1 month time.

Do you get to use libraries or not?

If you get to use libraries, there are plenty of libraries for both of these tasks, so it will be fairly equivalent. In fact, if you want to really support XHTML well, it will probably be more complex, because you have to take into account namespaces; a tag or attribute is not just a simple string, you have to consider the namespace it's in, so you would have to deal with that.

If you have to write it from scratch, I'd actually also bet on it being quicker for HTML5. The exact algorithm is specified in the standard; you just have to translate that from pseudocode into whatever language you're working in.

In XHTML, you have to look through several standards (XHTML 1.0, XHTML Modularization, XML 1.0 which includes the DTD, XML Namespaces, XML Schema), and then translate from the specification in a declarative style into a algorithm that can actually be used to parse the document. XML parsing is actually quite complicated.

> XHTML was a worthy goal – with it, we wouldn’t have a need to run headless Chrome for tests. We could parse the web, and actually use the data.

What? XHTML wouldn't have replaced the use of JavaScript to add features and load content. It wouldn't have replaced people's inappropriate use of tags. It wouldn't have replaced the fact that web pages are written for human consumption, and so don't generally try to include appropriate annotations on data for processing by other tools.

> HTML5 is amazing for existing browser vendors, developers, and in the short-term, users. But everyone else loses. Horribly.

Imagining that some magical standard is going to make everything better for users is a pipe dream. Providing good data formats and APIs for exporting data is hard, and is a different problem that providing user interfaces and interactivity. There's not some way in which XHTML could have been extended to serve both purposes; they are just too different.

This is why, when people care about providing programmatic access to data, they generally provide two endpoints; one serving HTML, CSS, and JavaScript for humans to interact with, and one for providing JSON for machines to parse. There have been endless attempts to try and make one standard that would work for both purposes, and they've failed because that's just not a good approach for solving the problem; instead, it's better to just have an API that both the UI code (whether on the client or server side) can use, and other developers can use if you want to expose it.

fjsolwmv · on April 13, 2018

In the long run, we're all dead. We need short term solutions.