> This is unsurprising given the total willingness and business incentive of major players such as Google to drive the applications use case and their comparative disinterest in the semantic hypertext use case
Indeed, I am consistently surprised at the document-display features that are still missing from web browsers that are increasingly preoccupied with serving as application runtimes.
The web was originally envisioned as a platform for scientists to share documents, but thanks to Google dropping MathML, there's still no cross-browser, non-kludge way to display math.
Browsers still can't justify text properly. Even with hyphenation (which iirc is still a problem in Chrome), the greedy algorithm used for splitting text across lines still results in too much space between words when justified.
There seems to be a complete lack of interest in Paged Media support, so if you want your web-page to be printable with nice formatting, you basically need to provide it as a PDF.
I've gotten to the point where I'm often happier to read something in-browser as a PDF than as a web-page. Sure, it can't reflow properly, but at least it won't have a 2-inch sticky header and make XHR requests to 10 different domains. My vision-impaired father reports that his text-to-speech software now works better with PDFs than with real web-pages. This is insanity.
> Browsers still can't justify text properly. Even with hyphenation (which iirc is still a problem in Chrome), the greedy algorithm used for splitting text across lines still results in too much space between words when justified.
> There seems to be a complete lack of interest in Paged Media support, so if you want your web-page to be printable with nice formatting, you basically need to provide it as a PDF.
These are the two main reasons I think of whenever someone complains that scientific publication should be Web-based instead of PDFs. Browsers are still generally lacking when it comes to proper typesetting.
On the other hand, the restriction that scientific publications must be renderable onto reconstituted dead trees makes it that authors often spend multiple “pages” explaining what a single interactive chart would explain far better. This makes science less accessible than it should be and in my view is part of the reason media so often misrepresent scientific findings.
> On the other hand, the restriction that scientific publications must be renderable onto reconstituted dead trees makes it that authors often spend multiple “pages” explaining what a single interactive chart would explain far better.
On the gripping hand, in a century those reconstituted dead trees will still be legible by anyone who can lay hands on them; the likelihood that an interactive presentation from 2019 will be legible to anyone other than a computer archæologist in 2119 is effectively nil.
Heck, there’s a decent chance that it won’t work in a year or two!
Interactive charts are great for supplemental material, but moving primary sources (e.g. journal papers) to anything as ephemeral as the current web stack is a terrible idea.
Also there is no way to include a file... Let's say you have a menu at the top of your website. You can't use frames because frames are bad mkay, and you can't include it. So you have to copy-paste it into every page manually, using some build system or even using some DHTML technology.
Chrome and only Chrome has an include extension, but for security reasons (something about cross-site foo), you can't use it when testing an HTML file on your disk, because you need a hostname to check the same origin policy. You need to put it into a local webserver, which, fortunately, is fairly easy with Python.
You can do this with XSLT, which actually works in Chrome, Firefox, and Safari, both on mobile and desktop. (I've been using it in a small site of my own for a while.) You can even do stuff like page-dependent highlighting of menu items without any JavaScript.
Granted, I'm surprised XSLT is supported at all in modern browsers, and who knows how long it will continue to be supported, but it does work now for the basic "menubar at top" use case.
Yeah, I never understood why there wasn't an <include> element in HTML, that would just paste the content of the linked document. Why wasn't such an element added in the late 90's or early 2000?
In the early days, when dynamic webpages were processed by something called CGI and even when we started using PHP, this was like 90% of the use case for doing a dynamic webpage. The last 10% might have been some sort of counter.
They weren't common. It depended mainly of the server configuration, language used, etc. Most of the websites were static with some dynamic pages. Includes becane really popular with the rise of PHP and ASP language.
Right, but server-side includes mean unnecessary technology and bandwidth use to work around the absence of a trivial(ish, there are probably a few snags) feature.
> I never understood why there wasn't an <include>
> element in HTML, that would just paste the content of
> the linked document. Why wasn't such an element added
> in the late 90's or early 2000?
As far as I know, for SGML (and therefore old fashioned XML, too), 'include' gets realized via parsing of external &entities;. That's how the Mozilla XUL applications (Seamonkey, Firefox, Thunderbird) etc. realized localization, for example.
I don't know, when XInclude has first been ratified (the current, 2nd revision, of the spec is from 2006), but with it, a generic inclusion mechanism exists for XML, therefore also for XHTML.
With XHTML you could (in theory, as said, the spec is there) use XInclude.
<?xml version='1.0'?>
<html xmlns="http:www.w3.org/1999/xhtml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<head>
</head>
<body>
<p>120 MHz is adequate for an average home user.</p>
<xi:include href="disclaimer.xml"/>
</body>
</html>
I am not quite sure, yet, how namespace mixins are handled in XHTML, I think, one must define a new schema (which, honestly is a crappy requirement and should be abandoned), effectively creating an XHTML+XInclude document type, but that may be only a real issue for strict validation.
When I first started out I was using free web hosts that didn't support server-side includes or PHP, so I ended up building my own JavaScript front end framework. Unfortunately, I didn't actually know any JavaScript, so I wrapped every line of my HTML header template in document.write() and included it in a script tag. There were some performance issues.
In my purely HTML github pages I ended up hacking a header into the common CSS file bc that's "included" everywhere.. it's ugly and I don't think you can even put links in it... but it sorta works. If I wanna tweak it then it's all in one file
> but for security reasons (something about cross-site foo), you can't use it when testing an HTML file on your disk,
I think you can disable that check as a command line option. I cannot check it right now since I don't have chrome installed however I think it was --allow-file-access-from-files .
Chrome didn’t drop MathML, they dropped their horrible implementation. Currently Igalia are building a new Chromium MathML implentation: https://mathml.igalia.com/
They dropped their implementation of MathML in 2013. In the six years since then, they didn't fix it or replace it with a better one.
I'm extremely excited by Igalia's work on MathML and glad that the Google Chrome team had given it their preliminary approval, but in the context of criticising Google for working on the "application web" and ignoring the "document web" (for scientists and others), it doesn't change anything[0]. Google is neither doing the work, nor funding it[1], and their contribution until now has mostly (solely?) been to be open about merging it.
[0] not that you explicitly said that it did...
[1] It's funded by the NISO and the Alfred P. Sloan Foundation.
> thanks to Google dropping MathML, there's still no cross-browser, non-kludge way to display math.
Probably a huge tangent, but UnicodeMath [1][2] is at least encodable and (basically) readable just about everywhere UTF-8 (et al) is supported, which is all browsers today, if unfortunately only just a linear presentation by default. I'm surprised there aren't more progressive renderers for it to up-level it to more traditional two-dimensional renders in browsers directly yet (after several years being standardized as a Unicode Technical Note). A cursory glance shows that even MathJax still doesn't seem to support UnicodeMath (at least out of the box), like it does MathML and AsciiMath. (But I think raw UnicodeMath is easier to read than MathML or AsciiMath in a "progressive" fashion from linear to "professional"/two-dimensional rendering. Though that's a personal preference/aesthetic that can get quite subjective.)
> My vision-impaired father reports that his text-to-speech software now works better with PDFs than with real web-pages.
I have enough experience with PDFs and screen readers that I find this very surprising. Which PDF reader does he use, and does he get most or all of his PDFs from a particular source?
> Browsers still can't justify text properly. Even with hyphenation (which iirc is still a problem in Chrome), the greedy algorithm used for splitting text across lines still results in too much space between words when justified.
You might try:
word-spacing: -0.1ex;
Good justification of text requires that sometimes the spacing between words is a little narrower than normal. Perhaps browsers are unwilling to do this unless you explicitly give them permission. But I wish they could just adopt a good text-flow algorithm like Adobe InDesign had.
But good justification also requires hyphenation. And last I looked, browser support was spotty. It did not work on Chromebooks, for example.
A "page" exists because paper exists. When you aren't using paper why would you break up your content into arbitrary fixed size increments, rather than using (say) "sections", which are the size of the content you put into them?
Right, you shouldn't do that. Instead, you should leave your content as-is when viewed on a screen but add a separate stylesheet that gets applied only when you print to format your content for the paper: eg. put footnotes at the bottom of the page. Unfortunately, features like this are largely missing from browsers.
For one, because you want to display existing, page-divided documents (e.g. books), and keep the same frame of reference (the page).
Second, because you want to print -- in which case, your artifact ends up in a page.
Third, because a screenful, even if dynamic (based on resolution, etc) is still a kind of a "page" (in HTML parlance, a "viewport"), and you might want to design to take advantage of what fits in one.
It's actually all already standardised in CSS, since like a decade ago or so, yet no major web-browser has proper support for it.
I've tried using it for my resume since may about 10 years ago, to make it possible to convert it to PDF with a standard browser, but to this day many of these features aren't fully supported.
That's kinda the main selling point of software like Prince, which has implemented all of this ages ago, but it requires a licence.
Breaking up documents into arbitrary (but specified) fixed size increments is a useful way to refer to parts of a large document. Sure, it'd be better if everything was hyperlinked with semantic anchors, but until the trends identified in the article are reversed, that's not going to happen. Pagination is a good fallback option that everyone understands intuitively.
The partition into pages is not a crucial point of proper typesetting. If I understand the OP, the main complain is that you cannot properly typeset text with equations in a standard way that looks acceptable.
That last paragraph absolutely nails it. Trying to re-implement the operating system as the browser is mutually exclusive with it being a good browser or standard for webpages.
> If anyone constructed a PDF, which was itself blank but, via embedded JavaScript, loaded parts of itself from a remote server, people would rightly balk and wonder what on earth the creator of this PDF was thinking — yet this is precisely the design of many “websites”. To put it simply, websites and webapps are not the same thing, nor should they be. Yet the conflation of a platform for hypertext and a platform for applications has confused thinking, and led developers with prodigious aptitude for JavaScript to mistakenly see mere websites of text as a like nail to their applications hammer.
It's worth remembering that, because there were no Javascript or HTML APIs for many things people felt their pages needed, a huge portion of the web was loading up flash, silverlight, or java applets.
Now, maybe they never should have allowed Flash and friends, but the genie was well out of the bottle. You could either have HTML and JS based functionality, or the plugins.
> It's worth remembering that, because there were no Javascript or HTML APIs for many things people felt their pages needed
The OA states in his article, that he is fine with many of the new APIs. It's not HTML5 vs. XHTML, it's 'applified web' vs. 'document web'. And I agree.
I’ve seen PDFs like these- it’s one way for publishers of (costly!) industry research reports to allow limited downloads and still track who’s reading them (the kind where a copy is listed at 5000 USD for non subscribers). And they are horrible to use, but do serve a purpose.
what is interesting to notice is that is the research industry that gave birth to the web that this article rightfully wants to differentiate from document/application web
> If anyone constructed a PDF, which was itself blank but, via embedded JavaScript, loaded parts of itself from a remote server, people would rightly balk and wonder what on earth the creator of this PDF was thinking — yet this is precisely the design of many “websites”.
Unfortunately, this is also the case for many PDFs amongst some of the more niche use-cases. Like NDA-bound many-thousand paged specifications. The PDF standard lets you do it, so some do. Not all readers actually support embedded JS, but most do in some limited capacity.
Yes and no. Whatever prints the PDF has to be able to run the JS payload.
In the case of a physical printer, it's usually set up so there's [redacted] instead of blankness (that the JS replaces), so you get redacted images and paragraphs.
In the case of software printing to a static PDF which you can then print, the JS payload is generally set up to embed a series of rather obvious markers so that if the static copy is ever publicly revealed, they can go after you (DRM is easy when you only have a dozen customers in the entire world), but most PDF software that supports JS also supports JS disabling printing features so that little disincentive is unnecessary.
If it is 10 times faster on average, safer and generally less annoying it might have a natural advantage.
We could get some of the advantages if we just started shunning JS and telling people ("Proudly Javascript-free and free of third party trackers! You might have noticed that this site loaded quicker than other sites you have seen lately. Mostly this is because we don't bug your machine with insert-carefully-crafted-explanation-here.")
A lot of issues with this post, but I'll just take two:
> rather than articulating particular requirements and principles but not how they need be met, the WHATWG specifications tend to be written in a highly algorithmic and prescriptive style; they read like a web browser's source, if web browsers were written in natural language.
It turns out that if you want to have pages work the same in every browser you need to have every browser doing the same thing when it interprets the pages.
> The pursuit of the semantic web has changed in the era of HTML5, which represented a rejection of XHTML — to me, a seemingly bizarre rejection of having to write well-formed XML as somehow being unreasonably burdensome.
In practice, people won't write valid XML. We had a lot of cargo-culting, people putting self-closing tags into HTML, but people weren't using XML editors. And without an editor that understands XML it definitely is unreasonably burdensome to create XML. What we saw instead was that even most "XHTML" documents were not valid XHTML and were served with an HTML content type. If you had served them instead with an XHTML content type the browser would have simply refused to render them.
Both of these were recognitions that the previous approach wasn't working, and that if the spec was to achieve its goals we needed to try something different. Under WHATWG the spec has moved from "yes, the spec says this but it doesn't matter" to "the spec describes what the browsers do, and the browsers treat cases where they violate the spec as bugs". Sites now really do work the same across browsers, and WHATWG deserves a lot of credit for that.
> In practice, people won't write valid XML. We had a lot of cargo-culting, people putting self-closing tags into HTML, but people weren't using XML editors.
That's probably the most popular argument against X(HT)ML and in favour of HTML5. In fact I think among the popular programming/markup languages, only very few are so forgiving. Namely it is HTML(5), JS, CSS and perhaps Shell script and Perl. But even in these cases following best-practices and using Linters has become extremely popular. On the other hand you have strongly typed languages or even languages like Python or Makefiles that even make sure you use consistent whitespaces.
I think nearly everybody uses quite powerful editors with a load of plugins these days because accelerate editing and also do autoformatting.
> Sites now really do work the same across browsers, and WHATWG deserves a lot of credit for that.
On the other hand there are just 2 popular/"usable" browser engines left. I think XHTML is far more modular, maybe it would even be possible to outsource some browser rendering tasks to XSLT transformations. HTML at one point became a messy standard through the Browser competition and then WHATHG somehow manifested that situation I guess. Now there's a massive monoculture of Browser engines.
This is one of the things I liked so much about XML; you could have a single document and use XSLT to generate a web page or (via XSL-FO) a nicely-formatted print document.
The way developers test Blink/Webkit still (they just test one browser in the family, be it Chrome or Safari largely depending on the web developer's home hardware) it's hard to consider it a hard enough fork to count as two separate engines. Sure, reality shows more divergence than such web developers that test that way tend to generally expect/believe (Safari as Apple-only "LTS" Chrome), but that perception and its reflection in testing alone seems reason to consider the family together (especially with Edge moving into the "family" for among other reasons that very reason of "simplifying web developer testing requirements").
> it definitely is unreasonably burdensome to create XML
Disagree. You open the tag, you close it. Nested. I worked with a non-developer who had some KML (https://developers.google.com/kml/) dumped on her without any training whatsoever and when I explained a few basic thing, including the above rule, she got the principle of it in a few minutes. Because it's not hard (although dumping a KML task on a non-dev is a bastard thing to do).
> but people weren't using XML editors
I personally think that lowering the bar to let as many people in as possible isn't necessarily a good idea. Just IMO
I wonder how many people think it was unreasonable due to how you had to send back the page to be truly valid; application/xhtml+xml, application/xml, or text/xml. Due to IIS, browser support, and the platforms I personally used, that was the only requirement I never felt I had to meet.
I personally agree with you, and still close my tags, use quotes around attribute values, and provide empty/duplicate values when needed.
I also still use XML for data storage, and used ColdFusion back in the 6/7 days, so that probably had a influence on my choices.
I agree the approach sounds reasonable, and I think if we were having this conversation twenty years ago I would have been on the pro-XHTML side. But we tried it, and it didn't work.
XHTML could either be served as HTML or as XML. When served as XML it would not render in all browsers, but when served as HTML it would. Since serving as XML did not provide any particular value for the document author, there was no reason to use it. But XHTML served as HTML is just HTML with some extra slashes which are ignored.
> It turns out that if you want to have pages work the same in every browser you need to have every browser doing the same thing when it interprets the pages
Yes, but the change was that first browsers adapted to the standard, now the standard adapts to the browsers. That's the big change.
And this makes it better how? That's one of the reason a blink quasi-monopoly is not a good thing. We risk them becoming the standard to which the web must conform.
On the other hand, an independent standard organization (which, of course, the w3c isn't) would not have to follow a browser's vendor agenda. They'll ideally want to do what's right for the web.
To sum it up, I don't think having browser vendors set the standard is in the best interest of the web platform, especially when they are so unequally represented in actual usage (and we can debate about how we got ourselves in this situation).
Web development today is loads better than it was.
Then: if I wrote to the spec I would find the major browsers would all handle my page differently. Writing complex cross-browser pages was a constant pain involving IE-specific hacks (special CSS comments that only IE understood). The browsers were not interested in implementing the spec because it meant a ton of work for no benefit and existing pages would break, and the W3C had moved on to XHTML-only approaches.
Now: I can write to the spec and the major browsers (based on WebKit, Blink, EdgeHTML, and Gecko) will all do the same thing with my page. Spec violations are bugs, and are taken seriously by the browser vendors.
(Disclosure: I work for Google, though not on Chrome)
But what benefit is a standard if browsers does not implement it? It can be "right for the web" only in a theoretical sense if it is never implemented.
Thank you for posting this! Now I know that I'm not the only “grumpy
old man” who smells the fishy part of the “Modern Web”.
With that said, I would love the author to go on and elaborate on other
advantages of XHTML2, such as possible integrations with XForms
(including more inputs and sending requests without page reloading and
without JavaScript), XFrames, the single header element <h>, every
element as a hyperlink, etc. Then there are MathML and XSLT. If XHTML2
became a reality, we would probably see XSLT 2.0 more actively adopted
by the browser vendors, which is a good thing in my book.
XForms is one of the things I wish would be more widespread, it’s such a good idea in principle. I think it just gets tarred with the “XML is bad” brush and ignored.
Someone should write an article and call it something like “‘XML
Considered Harmful’ Considered Harmful”. By no means is XML a very nice
one to work with (manually). But at some point during late 1990s it had
a real chance of becoming the document and data mark-up language, with
standards allowing you to do pretty much anything with it. And a bunch
of WYSIWYG tools for those who are allergic to plain text and scriptable
editors. And I think it would be a slightly better world.
I wish the XML community had given proper priority to usability along with quality tooling and examples. There was a solid decade where people would work on a spec which sounded cool but effectively never shipped from the perspective of working developers or did so with enough bugs/inconsistent support/poor performance/bad UX that it was a net cost. As a simple example, you still can’t use XPath 2 portably because libxml2 never implemented it – absolutely nothing that the XML standards community was working on had even 5% of the value that would have come from fixing that or countless similar problems which had a constant pressure to stop using XML. The same was true of good documentation and examples: the assumption was that other people would take time to learn these convoluted specs but most of them started using JSON instead because they could ship so much faster.
> you still can’t use XPath 2 portably because libxml2 never implemented it
That's not the fault of the XML community at large, but just a lack of resources for the implementation of an unpaid open source project.
You can happily use XPath 3.1 with Saxon, BaseX, eXist. All three use Java, so, it's not portable, but Saxon has a C library, that mirrors the Java version 1:1, and that C library is also available as open source, though, it lacks some XPath 3.x features, like higher order functions, then.
For the command line, there is a partial XPath 3.1 implementation with 'xidel'.
But I agree, libxml and libxslt being at XPath 1.0 for so long, did not serve XML well.
I classed that as something for the XML community to prioritize because implementations are so important to adoption for standards. If you design standards and want them to be used, at some point you need to figure out how to get resources to support development of key implementations, help major projects migrate to alternative implementations[1], or develop a replacement if nothing better is available.
The fragmentation you mentioned is part of what made this so frustrating: if everything you used was within certain toolchains, the experience was fairly good but then you'd need to use a different ecosystem and either drop back to good old XPath 1 or take on more technical debt. In many cases, the answer I saw people favor was leaving the XML world as quickly as possible, which is something the community has a strong interest in.
1. For example, Saxon added support for Python just a few days ago: https://www.saxonica.com/saxon-c/release-notes.xml Imagine if that had happened a decade ago and everyone who was stuck with libxml2 could have easily switched?
On a fair note, one should also take into account, that Saxonica is a rather small, even if highly skilled, shop and the program, they create, is a huge undertaking.
The first version of XML was published in february of 1998, I think the chance you're thinking of was somewhere around early 2000s - I'd think 2004 was still possible, at any rate before the apparent failure of SOAP and XSD. By 2006 it was becoming clear it wasn't gonna happen.
JSON is a better data language, but a lot of that can be laid at the feet of XSD and the big squandering of momentum it represented.
I think the problem wast that while XML was an OK mark-up language for human-readable documents, it was horrible as a structured, data interchange language. Unfortunately, that did not prevent people from trying to use XML for data interchange, and this caused widespread hate of it.
XML screwed the pooch by trying to add namespaces after the 1.0 standard was published.
Handling namespaces correctly requires that the parser API be changed in non-backwards-compatible ways. This would have broken every single piece of code that used an XML parser. So instead, people mangled documents by simply flattening all the namespaces together if you tried using the old API.
This was a godawful nightmare and made everybody who was around at the time absolutely hate XML namespaces -- even those of us who know why they are so important. Plus namespaces are not exactly an "ELI5" topic, so a lot of lazy programmers looked at this and said "that's complicated, I don't want to learn it, HEY LOOK there's this older deprecated API that doesn't have them -- I'll use that!" So the old APIs became immortal and in fact gained additional users long after they were deprecated.
They should never have let the 1.0 standard out the door without namespaces in it.
Alternatively, they should have made it more backwards compatible and, most importantly, heavily pressured implementers not to be complete slackers about usability. The cycle I've seen most frequently was something like this:
1. User gets a simple XML file and writes XPath, XSLT, or other code which says `/foo/bar`, which fails.
2. User notices that while it's written as `<foo><bar>` in the source, it's namespaced globally so they change code to use `/ns:foo/ns:bar`, which also fails.
3. User does more reading and realizes it needs to be `/{http://pointless/repetition}foo/{http://pointless/repetition... or something like repeating the document-level namespace definitions on every call so their `ns:foo` is actually translated rather than treated as some random new declaration.
4. User does something hacky with regular expression to get the job done and at the next chance ports everything to JSON instead, seeing 1+ orders of magnitude better performance and code size reductions even though it's technically less correct.
That experience would have been much less frustrating if you could rely on tools implementing the default namespace or being smart enough to allow you to use the same abbreviations present in the document so `<myns:foo>` could be referenced everywhere you cared about it as `myns:foo` with the computer doing the lookup rather than forcing the developer to do it manually.
Hey, I actually wrote an application that produced strict xhtml2, and served it with a application/xml+xhtml content type (iirc).
It was a page that displayed IRC logs, with linkable anchors for each line, automatically breaking words that were too long for the browser, turning text into links by regex, interpreting terminal colors etc.
Each time I had a tiny cross-site scripting bug in there (some part wasn't XML-escaped properly), some data would eventually trigger it (you wouldn't believe the amount of encoding junk on IRC), the browser would simply refuse to render anything at all. Inconvenient for my users, but it made sure such things didn't slip by unnoticed.
---
This was a side project, done for fun, and a few friends that used my site. If I had been trying to make money from it, the first thing I would've done is to switch to something less strict, so that tiny errors wouldn't stop rendering the whole page.
Around 2010 I did some web programming. XHTML was so much easier to parse and render from code because I could use some very powerful XML libraries. HTML is different enough from pure XML that it requires a different parser.
It was very nice to load an XML document and look for tags in a specific namespace instead of using a specialized HTML templating engine.
One of the things I hate most about HTML 5 is that documents don't need to be well-formed XML, and are even encouraged not to be - `<hr>` instead of `<hr/>` etc, thus excluding XML processing tools from being able to work with HTML documents. Then one needs to support tag-soup / follow the HTML5 parsing guidelines to the letter when the whole mess could have easily been avoided. This affects text editor plugins etc. where one might want to utilise a single plugin/codebase and use XPath to traverse both XML and HTML documents easily
I don't quite understand the XML fetishism. HTML is originally based on SGML, and SGML is every bit as structured as XML by definition, since XML is specified as a proper subset of XML. From the XML spec:
> The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML.
The "generic" part refers to XML being canonical, fully-tagged markup not requiring vocabulary-specific markup declarations for tag omission/inference, empty elements and enumerated attributes like is necessary for HTML and other SGML vocabularies making use of these features.
That XML has failed on the web doesn't mean one has to give up structured documents. In fact, HTML can be converted easily into XHTML using SGML [1]. If anything, markup geeks should embrace SGML (an ISO standard no less) to discover the power of a true text authoring format. For example, SGML supports Wiki syntaxes (short references) such as markdown.
Look at this "<p<a href="/">first part of the text</> second part". This is a valid document fragment in HTML 4.01 because HTML is authored in SGML.
Writing a correct XML parser is much easier than writing a correct SGML parser, and what's more important, it's much easier to recognize errors.
I agree with OP that HTML5 should have been XML from the start. Nowadays, you hardly write any HTML by hand and even if you do, it's easy to write syntactically correct XML.
It's true that you can convert any HTML into XML with ease but it's still a stupid, unnecessary step.
> I agree with OP that HTML5 should have been XML from the start.
The key requirement for HTML5, And why it succeeded where XHTML had limited success, was that existing HTML docs had to work with it. Which is why it has both an HTML and an XML format.
It was not wrong for it not to be pure XML, it was absolutely necessary.
> You could write XHTML 1.0 documents that were backwards compatible to browsers that only understood HTML 4.01.
You could and a lot of people _tried_, or at least pretended to. But the vast majority of documents that tried to do this failed to actually be well-formed XML, for various reasons... In practice, even restricting parsing as XML to cases when the page was explicitly sent with the application/xhtml+xml MIME type would leave a browser with problems when sites sent non-well-formed XML with that MIME type. This was a pretty serious problem for Gecko back in the day when we attempted to push XHTML usage (e.g. by putting "application/xhtml+xml" ahead of "text/html" in the Accept header). So we stopped pushing that, since it was actively harming our users...
The point is that this hasn't happened; neither back in XML's heyday, and much less today. Now you can bemoan XML's demise until the end of time, or you can fallback to XML's big sister SGML. As I said, SGML has lots of features over XML that are in fact desirable for an authoring format, such as Wiki syntaxes, type-safe/injection-free templating, stylesheets, etc. on top of being able to parse HTML. Many of these features are being reinvented in modern file-based CMSs and static site generators, so there's definitely a use case for this. Whereas editing XML (a delivery rather then authoring format) by hand is quite cumbersome, verbose and redundant, yet still doesn't help at all in how text content is actually created on the web.
Is SGML even still used? The only usecase I remember besides HTML is DocBook and that of course also has a XML variant for a long time.
SGML is needlessly complex as an authoring format. Even HTML was considered too complex and that's why we got lightweight markup languages like MarkDown and AsciiDoc.
I would be very surprised if we ever turn back to something like SGML. Especially as there are well designed LML as AsciiDoc or reStructuredText.
To give you an idea of what SGML is capable of, see my tutorial at [1]. It implements a lightweight content app where markdown syntax is parsed and transformed into HTML via SGML short references, then gets HTML5 sectioning elements inferred (eg. the HTML5 outlining algorithm is implemented in SGML), then gets rendered as a page with a table of content nav-list linking to full body text, and with HTML boilerplate added, all without procedural code.
SGML was in fact designed to be typed by hand, as an evolution of earlier mainframe markup languages at IBM. The idiosyncratic shortcut features are supposed to reduce the number of keystrokes needed for entering text.
HTML was based on SGML, but HTML 5 is explicitly not SGML anymore and specification calls the format "inspired by SGML". So to be fully conformant you need a custom processor instead of being able to use standard tools.
HTML5 doesn't cease to be based on SGML by a browser cartel with the express intent to transform the web into JavaScript-heavy web apps declaring so. WHATWG isn't an accredited standards body so what they declare a "standard" or "conformant" means shit. Especially if they don't bother to actually publish a standard that doesn't change all the time. Their "living standard" thing is at best a collaborative Wiki space of sorts where (a closed group of) "browser vendors" attempt to agree on how to do things, and is falling apart lately. WHATWG's "standard" has witnessed the web becoming a Chrome monopoly, and Opera and MS to cease browser development altogether.
SGML is the only game in town able to parse (a significant part of) HTML based on an actual standard, and is also the only realistic perspective for folks interested in the web as a standardized communication medium going forward.
HTML was "based on" SGML only in the sense that it borrowed a lot from SGML. However in practice it was never an application of SGML. HTML4 tried its best to move developers to SGML based HTML but devs ignored it.
HTML5 recognises that there was this gulf between the specification and the actual usage and sided with real world usage.
Even if `<hr />` (or others, e.g., `<img />`) would be required XML processing would not work. HTML (prio to HTML5) was full of quirks (e.g., table handling, formatting elements, ...), which cannot be expressed by a DTD that is used from XML. As a result, the DOM as seen from an XML POV could always be different from the real DOM even if the source could be parsed.
HTML5 just standardized all these quirks, leading to a uniform parsing model instead of an even bigger x-browser mess.
I wouldn’t say “easily” avoided. You can’t ignore the billions of web pages that would have already existed in a format that was non-compliant with XML at the time. With so much “prior art”, there is simply no way that any browser will ever be able to throw out its fuzzy/imprecise parser, which means that support for well-formed XML requires them to maintain two readers: precise and imprecise.
As far as XML “tools”, I am shocked at how even now I encounter real XML parsers that don’t necessarily reject malformed data files but do atrocious things with them (like silently pretend that certain tags were not even in the file). Thus, I end up using extra steps like a linter as a front-end sanity check. And while this example is a pure-data application, a linter is also a sensible front-end sanity check for HTML. XML isn’t going to win over HTML if it requires the same steps to clean up imperfections in the process.
One benefit of XHTML not mentioned is that, as an XML spec, it can be embedded into other markups. OpenDocument / OASIS ODF for example, uses it extensively - the written documents are basically just HTML, which for many applications is much more accessible than the OpenXML equivalent.
A lot of people used to emit XHTML with invalid string builders rather than XML serialisers though. Much XHTML was ruinously broken.
> The pursuit of the semantic web has changed in the era of HTML5, which represented a rejection of XHTML — to me, a seemingly bizarre rejection of having to write well-formed XML as somehow being unreasonably burdensome.
This is a classic "worse is better" situation. HTML5 may be "worse" than XHTML, from the standpoint of extensibility, namespacing, code cleanliness, and so on. But HTML5 is simpler to write for people who knew HTML4, and easier to get right using the one ubiquitous web development practice: staring at the rendered result in your browser, which every web developer has installed. So it's "better", and ends up winning.
The horrible "be liberal in what you accept, be conservative in what you do" meme is the cause of this. At some point it was decided that it would be too rude to emit a compiler error on someone's junk HTML, so browsers just added hacks to make it work anyway. It has to be backwards compatible too; don't dare break behavior of an old tag that someone out there depends on. There's not much point in switching to XHTML2 with sane versioning, namespacing and extensibility if it's also going to be hacked around because no one wants their input rejected. It's hacks all the way down, by design, forever.
Had the XHTML2 standard been adopted by many, browsers would have still had to support all the other non-X HTML documents, which would have never disappeared.
HTML5, with the exception of the new elements, just formalized the existing web parsing strategies for better cross compatibility.
As a web user, I don’t miss the days of XHTML sites randomly completely breaking because some tag wasn’t closed.
As a developer, I only miss XHTML2’s support for `href` on any element.
> As a web user, I don’t miss the days of XHTML sites randomly completely breaking because some tag wasn’t closed.
Ironically that was a very rare event in practice. For it to even possibly happen, three things had to come together which were still uncommon even at XHTML's height:
1. The page had to be written as XHTML
2. The page had to be served as XHTML
3. The page had to be parsed as XHTML
Usually at least two of those things weren't happening.
The second didn't happen precisely because people wanted to avoid the sites completely breaking. So it didn't happen because people avoided XML, in practice.
It doesn't make much sense to state things, I can't back up with references (I do not remember where I read it), but a few months ago I read, that by 2006 (or 2008) 60% of all web pages were XHTML. Now, whether that means, that they have been served with the right media-type, I do not remember, but what I know, is that "nerdy" places, and that includes Steam, HumbleBundle.com and GOG.com, were all running on XHTML. So, it was not just a bunch "the minds of a relatively small number of developers".
I'm not sure where your information comes from because it sounds very suspect to me. Sure XHTML was a thing in nerdy circles, I know because I was one of them, but we were hardly the majority.
The vast majority of sites were "HTML 4.01" others were at best "XHTML 1.0 Transitional" (which in practice meant the same thing). Those using pure XHTML were relatively few. And of those who did, no major site served it as such because it would have locked out IE users, IIRC.
I think both applications and media would benefit from a split of html into seperate language for defining app-like sites, and traditional content pages. Html and the dom APIs are currently strange mix of content orinted semantic elements and app oriented (often non-semantic) elements. Not to mention that the default flow layout does more harm than good for most applications.
I think this was the original vision of XML: eXtensible markup language. With globally unique names XML was meant to produce a whole world of languages that could've been developed independently and mixed together in the same manner as Unix command-line tools. E.g. XSLT is a language for a XSLT processor; XSL-FO is a language for XSL-FO processor and so on. In this schema XHTML would've been a language for the browser that would co-exist with the rest of the ecosystem.
> I think both applications and media would benefit
> from a split of html into seperate language for
> defining app-like sites, and traditional content pages.
This! I would go a step further, even. The "web-applifier" community should just leave the classic web and do their own:
* protocol (I am sure, HTTP is not ideal for serving apps)
* GUI description language (document markup language for UI design, really?)
* runtime (let them have WebAssembly and whatever they need)
* each app then could have their own window, making it look like a traditional app
so... like Java swing or Adobe Air? I don't know all the reasons technologies those failed to deliver, but I do think that being able to deploy from the same platform that users use to discover the app (that is, the browser) as well as almost effortless instant updates has a lot to be said for it.
The reason HTML5 became what it is is because many people wanted to see the open web thrive as a competitor to closed, controlled eco-systems like the mobile application development platforms.
Just 10 years ago this was a mainstream view - there were groups who were fighting to give browsers web cam access so they could be used as video-messaging platforms, groups fighting to give location access so we could write location aware documents and apps etc etc.
If you view this in the context of needing separate models for documents and application, it means that you wouldn't need all that substrate of APIs for documents (i.e, why would you need a webcam API or even AJAX to render a blog article?).
i.e, why would you need a webcam API or even AJAX to render a blog article?
This seems like a lack of imagination. The modern scientific publication Distill.pub makes heavy use of AJAX (eg https://distill.pub/2019/activation-atlas/) and it's easy to imagine it using a webcam (eg, to demonstrate semantic segmentation)
One of the reasond a platform succeeds is because it can be many things to different people.
I completely rejected the idea that there should be separated document and application models. Instead I think there are capabilities which should be layered to add abilities.
I once searched for a way to send PUT requests using HTML forms because I despise using JS for core functionality but still wanted to build a nice REST API. Somehow I stumbled across XForms which was part of XHTML2 and supports PUT and DELETE only to find out that no browser supports it.
Now I hope that the form extensions proposal [1] will gain some traction.
The most common "solution" for this is to add middleware on the server-side which accepts a form parameter "_method" which supercedes the real HTTP method. I believe Rails uses this.
I'm aware of this workaround, but it's still a sad state of affairs, especially since POSTs have to be handled with more care because they aren't idempotent. So the browser will warn the user if they really want to resend the query if something goes wrong the first time.
I got on board the js train as a young developer just at the right time, when the term AJAX was something of a buzzword. Now my fellow developers give me dirty looks when I suggest our largely static site can do without redux, or that someone would conceive of building a simple accordion component in anything less than React.
i have been creating websites since the early 90s, and every single one of them was an application. html was always generated dynamically, and never static. any static html documents were embedded with a dynamically generated navigation.
the whole idea of a web of semantic hypertext was never a reality. that idea died with gopher. (anyone remember that?)
why? because gopher had a builtin navigation system that allowed you to manage directories and document structures that didn't belong to the documents themselves. semantic hypertext within gopher would have worked well. as would have applications. but without gopher we were forced to reinvent that navigation and squeeze it into our documents, overloading them with stuff that didn't belong inside.
think of a library with books. what the semantic hypertext promised was to make all those books into interactive texts where you can easily jump from one reference to another. but those books still need a library to live in.
what the web ended up doing was to remove the library completely, forcing me to reinvent the library within the book. suddenly the semantic document that i want to send you not only contains references relevant to its context, but it has to include the whole navigation for my library, because there is no way to do that externally. with that navigation included, you are no longer getting a semantic hypertext, but an application.
now, we get to write that application in javascript and actually run it on your device, instead faking it on the server. but on the flip side, on the server i can now finally go back to serving static documents. i can finally serve semantic hypertext documents as they were meant to be served because i can separate the application from the content, and i can treat the content as static as it was meant to be treated.
i am not reinventing navigation logic in javascript. browsers never had navigation logic in the first place. gopher had that. i always had to re-invent navigation logic for every site i built and was forced to embed that into html dynamically so that site visitors could find their way.
> The number of web browsers capable of consuming plain (X)HTML massively exceeds the number of web browser engines capable of consuming the modern application platform, a number which stands now at approximately two.
This piece gets many things wrong --- or at least fails to present both sides of the argument.
WHATWG did not "usurp" the W3C. The W3C abandoned HTML development by focusing on XHTML2, incompatible with HTML. This left an opening for someone to propose backwards-compatible extensions of HTML, since the W3C was explicitly not interested in that. WHATWG was formed to do this and produced HTML5. HTML5 was adopted by industry, XHTML2 was not. In an attempt to stay relevant, the W3C tried to stage a hostile takeover of HTML5. That attempt failed because the W3C had blown their credibility by that point. However there were still some advantages to having a W3C-approved HTML spec, so an agreement was reached where the W3C could approve the specs produced by WHATWG.
There are technical reasons why HTML <object> was not suitable for audio and video elements. For example, media elements need to expose media-specific JS APIs (e.g. seek()), but the MIME type of an <object> can change over time due to URL loading and DOM attribute changes, which would mean that the interface exposed by the element would need to change unpredictably over time, which would be a nightmare for developers. Also, there were very nasty legacy browser compatibility constraints around (mis)use of <object>.
The author misunderstands, or misrepresents, the WHATWG's spec design philosophy. Unlike the W3C, the WHATWG treated compatibility with existing Web content as essential. That means the WHATWG specifies existing browser behaviour where there is significant existing Web content that requires it. The W3C, on the other hand, tended to assume that Web developers pay attention to specs and that writing down conformance requirements would magically cause all Web content to be updated to satisfy them. Those assumptions are not true. (The idea that Web developers would migrate to XHTML2 because the W3C proclaimed it as the future was in the same vein.)
XML syntax for HTML failed for various reasons but not because of the WHATWG or browsers, which always supported XML syntax for HTML. One major problem is ensuring that dynamically generated XML pages are always valid XML. It is very easy to have bugs so that under some conditions (e.g. malicious user input) the server outputs invalid XML and produces a "yellow screen of death". Common examples of those bugs were bugs that allowed the Unicode 0xFFFE or 0xFFFF code points to slip into the XML output, which are not allowed in valid XML. A similar problem is when users interrupt a partial download of an XHTML file; the file has unclosed tags, so a conforming browser will replace the partially loaded and rendered document with a "yellow screen of death". This is not what users or developers actually want. (This assumes the browser bends the rules to allow partial rendering in completely loaded and validated XHTML documents, which is something users and developers do actually want.)
Exactly, browser manufacturers had a need for specifications that were of a higher quality than what W3C was producing in order to address (mostly) accidental implementation differences between different browsers. What was specified across existing specifications for html, css, and javascript was just not covering what was being shipped by browsers. And as these specifications were stuck in committees, browser developers had to just figure things out themselves.
HTML 5 started out as an W3C position paper to start the work of creating a backwards compatible successor to HTML 4 backed by several browser developers. The proposal for this was rejected in favor of continuing work on the non backwards compatible XHTML 2. As these browser developers had a need for a spec, the WHATWG was formed to create what became HTML 5.
Several years later when it became clear that their specification was the closest thing to describing what browsers actually do, W3C basically endorsed it as a recommendation. However, the WHATWG continues to drive work as it has been highly successful in producing the high quality specifications required to achieve the high levels of interoperability between the remaining browser engines.
XHTML 2 gradually became redundant as most of the functionality that web site developers actually needed from browsers got absorbed by HTML 5. As browsers could implement spec changes as they were happening, within a few years of the working group forming, it had become a huge success in standardizing many new features.
At the same time, many mobile browsers simply disappeared as Webkit (a fork of KTML by Apple) and the Chrome fork of Webkit by Google became the norm on mobile. This was important because many W3C working groups were dominated by people from the mobile phone and telecom industry seeking to control the standard for long forgotten things like WAP and the various mobile profiles of XHTML. Once decent browsers appeared on mobile (i.e. browsers that supported HTML 5), the need to continue to do work on XHTML 2 disappeared. Most of the companies that then dominated the mobile web were out competed by Google and Apple; neither of whom was a major mobile player at the time the WhatWG was formed. By the time W3C endorsed HTML 5, Android and IOS were dominating the mobile web to the point that even MS threw in the towel by first creating Edge (an html 5 browser with no backwards compatibility for IE specific stuff), and then recently just switching to Chrome entirely.
I'm a bit conflicted. I like the idea of Web browsers as an application platform, though it borders on "reinventing the operating system." The File API, Notifications API, MediaSession, etc. all make it possible to essentially have cross-platform desktop applications that users don't need to manually install.
I also don't, and never really cared much for "semantic hypertext," linked data, XHTML, XHTML2, RDF, nor is my browser usage predominantly about just sharing/receiving information.
That being said, there's so much involved in writing a Web browser that writing one from scratch would take years. There are only a few active implementations, and I don't expect that change for a long time, if ever (at least not without a lot of funding). So I can understand the author's viewpoint.
If semantic won out, there'd be less need for sophisticated search engines to make sense of all this stuff.
In fact, I'd go one further: If gopher had taken up multimedia quicker and then beat out the web, there would be no Google today.
These sort of things are not inherent to the web. People and entities want to transmit data in the format(s) that are easiest for them. It's up to the aggregators to make sense of it all.
A perfectly-marked-up HTML document only solves the problem for text expressed in HTML. It doesn’t solve any other format that people publish (even today we have links to PDF and Word and Excel, etc.). And of course there is meaningful information in formats that are not documents at all, like images. Some of those images are pictures of text that we might want to read.
It requires some investment to come up with tools for analyzing files. I’m wondering if the tools would have been as sophisticated if documents had lowered their barriers. For example, how do you justify developing a machine-learning model to look for more, if it seems all your documents are already semantically tagged with the details that were important to somebody?
And many times that choice is made for you: there’s a link to a Vulture.com post on HN right now (about Disney archiving the Fox catalogue) and every time I try to read it, Chrome crashes trying to keep up with the volumes of JavaScript they add, ostensibly to keep advertisers happy.
Are you running Chrome on an old phone, or perhaps with some extension which injects a lot of code? Looking at the page, it's certainly showing why ad blockers are popular but it doesn't crash or use unusual amounts of memory on either desktop or mobile Chrome:
I guess no one would object if you want to write an SPA using HTML5 and ES2020 that retrieves an XML semantic Web document and displays it in all kinds of CSS glory.
semantic web can only happen if paired with a new business model that is not advertising and e-commerce. Also, to get alignment from coders they need to fall into the pit of success. So it needs to be a new thing, separate and distinct from HTML.
Indeed, I am consistently surprised at the document-display features that are still missing from web browsers that are increasingly preoccupied with serving as application runtimes.
The web was originally envisioned as a platform for scientists to share documents, but thanks to Google dropping MathML, there's still no cross-browser, non-kludge way to display math.
Browsers still can't justify text properly. Even with hyphenation (which iirc is still a problem in Chrome), the greedy algorithm used for splitting text across lines still results in too much space between words when justified.
There seems to be a complete lack of interest in Paged Media support, so if you want your web-page to be printable with nice formatting, you basically need to provide it as a PDF.
I've gotten to the point where I'm often happier to read something in-browser as a PDF than as a web-page. Sure, it can't reflow properly, but at least it won't have a 2-inch sticky header and make XHR requests to 10 different domains. My vision-impaired father reports that his text-to-speech software now works better with PDFs than with real web-pages. This is insanity.