If it weren't for xml, endless (and therefore costly) discussions would ensue instead.
Also, on the more technical side, a barfing xml parser saves you from having to spend hours hunting for subtle bugs introduced by a permissive parser that is permissive in ways no one really understands fully.
Bob produces invalid XML file and gives it to Avery. Avery's permissive parser that he wrote in an afternoon reads it MOSTLY fine, but gets a few things wrong. Avery goes on with his work not realizing some of the data is incorrect, and Bob doesn't need to pay a contractor.
But if you write syntactically incorrect HTML, like forgetting to match a quote (<div align="right>) you will have a problem. The parser will not show an error, but the following content may disappear. Worse, the result might be diffent in different browsers. In the one you test with, things render like you intended due to the browser-specific error recovery rules. In the browser used by most of your audience, everything on the page until the next quote (in a different tag further down the page) disappers without trace.
HTML is a more complex grammar than XML which makes it harder to implement correctly. XML OTOH was designed with the KISS-principle. Clearly some characters needs to be quoted - eg. equals-signs or whitespace in an attribute value simply cannot be parsed unambiguously. Rather than having rules about which characters require quotes (does an underscore in an attribute require quotes? Does a dot?) XML just requires quotes always.
The poster argues that a parser should be able to parse malformed input, since there might not be an easy way to get the producer to conform to the spec. This is a good point, but there is no simple solution, because invalid XML is by definition ambiguous. In many cases it is possible to guess what the producer intended, however it basically have to be decided on a case by case base. You can't write a parser that can parse random invalid XML and always reconstuct the data structure that the producer intended. This would require that the parser was able to read the mind of the producer, and if that was the case, XML would not be needed at all.
In HTML5 there is activity to define canonical parse rules for almost any kind of malformed input. This allows more interoperable parsers, but the price is an extremely large and complex specification. Note also that this does not automatically fix mistakes. An HTML author might forget one of the quotes around an attribute value; with HTML5 there might be an official correct parse for this case. However there is no guarantee that this parse is what the author intended. There is simply no fool-proof way to fix content bugs on the client side. Postells Law have clear limits.
1. It is hell to write a new parser for convoluted standards.
2. permissive parsers have to waste more CPU cycles trying to fix everything (admittedly this is not as big of a problem in the current day and age)
3. Weak standards leads to badly coded pages. Badly coded pages generally have bad design. Badly designed pages don't help anyone.
4. It drives the people who care crazy (Ok, maybe that's not a real reason :))
5. I'm sure there are others I can't think of.
But it's a slippery slope. If one implementation accepts some obscure edge case, and someone relies on it, from then on every implementation must go out of it's way to handle it in the same way. This leads to complex and bloated software.
You can talk about what to do with this not-XML, but you can't just pretend it is XML. It isn't, because XML is not just this fuzzy thing, it's a very, very, very specific thing.
Worse, in the case of browsers, each browser vendor wants to be sure every major site works in their browser, so they have to go out of their way to program in all these special cases. See tolmasky's comment above.
1) It's simpler for humans to read. The purpose of XML was to be somewhat human readable and quotes help with that.
2) It makes possible for people to throw together a basic parser if for some reason they don't have access to the libraries (embedded, new language, etc)
3) Life would be a lot simpler if everyone followed the law (or specs).
For starters, it is now prohibitively difficult to create a brand new browser from scratch because you have to check against the millions of existing pages on the web, instead of simply implementing a standard. This is also partially why the web moves so slowly: I can tell you first hand from working on a browser that a lot of time that could have been spent implementing new features is instead spent "making things work like IE". Thanks to this, we now have a web where valid things don't work (huge pieces of HTML 4 and CSS 3 are generally missing), while crazy invalid things do work. There is limited time to work on things, and it is generally considered more important to emulate someone else's bugs so that the user won't go back to IE than implement things that have been in the standard since forever. Essentially what we have is the Parser's Prisoner's Dilemma.
One of the key features in our product (280Slides's PPT import/export) would have probably not been possible had Microsoft decided to accept any willy nilly OOXML. But, since they only read strict versions of files, as long as we conform to the spec we will be able to open (and compete) with PowerPoint.
Of course, all this misses the main point, which is that all this is really about is humand-editable file formats vs program-editable file formats. Most file formats (.doc .rtf .psd .ppt etc etc) don't have these strange problems with accepting bad input, and that's because they aren't written by hand. HTML is very much a hand written language and thus requires this. Many times XML does not have this same constraint. A lot of time it is applications that are generating XML, and so it is not so ridiculous to expect for it to be correct. So a good rule of thumb should be whether you expect the incoming data to be "machine generated" or "human generated".
The end "side note" that rendering on the web is different SOLELY due to implementation details and not at all due to parsing is also just plain false. Browsers are designed to accept all forms of tag soup, and when you have a web page with all open tags and absolutely no closing tags, guess what, your guesses as to what on earth it means is probably going to be different than that of some other browsers', and it will lead to different rendering.
This brings us to a problem that actually hurts users: part of the reason I have to check against every browser when I make a web page instead of just one is that it is not enough to see whether something is correct in one browser, because that browser may just be "being nice" to me. So, since correct rendering is not a sign of correct HTML, I must now test every browser to see whether it renders correctly as well. Thanks to this, we get the introduction of even MORE technologies, such as "strict" rendering, so the standard becomes even more convoluted.
<a href="http://news.ycombinator.com">News for <b>Hackers</b>!</a>
That snippet would look like:
(a (@ (href "http://news.ycombinator.com")) "News for" (b "Hackers") "!")
The snippet you showed is so small that XML's verbosity doesn't become a problem, so it doesn't really matter in this case.
Of course I don't doubt that you are able to write syntactically correct sexprs with the help of Emacs. A great number of tools (including Emacs I'm pretty sure) can help you the same way with XML. But for whatever reason - buggy software, bad editors, uneducated developers - invalid XML still happen. Why do you think substituting pointy brackets with rounded ones will change that?
I already backed off on the strong version of the claim - that s-expressions would lead to fewer errors in general for the average person writing XML - but I still prefer s-expressions because as a Scheme programmer I am accustomed to using sexps and I find them cleaner and easier to us. People who don't use or don't like Scheme or Lisp probably wouldn't feel the same way. It would probably be harder to use sexprs in MS Notepad, like I said. Maybe even in vim.
>The syntax is just as complex, and there seem to be plenty of quote signs and nested parentheses to forget or mismatch.
Any decent text editor will highlight strings, so it will be difficult to mismatch quotes. In Emacs, I like paredit-mode, which inserts a closing paren every time you enter an opening paren, so you always have a balanced number of parentheses. Combined with editor features that highlight whichever s-expression the cursor is in, and auto-indentation which sticks out like a sore thumb if you make a mistake, it's rare to mismatch parentheses and easy to fix if you do.
I'm aware that there are Emacs modes for editing XML, and I tried them, but I still found it worthwhile to write in SXML and then convert to XML, because it's just cleaner and easier. Partly this is just because you use the same editor mode for editing an s-expression-based XML-alike as you would use to edit code, so the commands are already wired into your spinal reflexes, and bound to the most convenient keys. But it's also less verbose when you get rid of all the closing tags and half of the brackets, so you can see more of the data at a time.
Also, if you're using an s-expression based language (and I know most people aren't), it's convenient to use a data format which is simply part of your language itself.
Of course, YMMV. If you're using, say, Python, you might prefer to use something like looks like Python's dictionary syntax (JSON?). I don't really know what Python programmers do. But I do know that there's a reason that there are a jillion XML-knockoffs with different syntaxes - XML is friggin ugly.
1. <div align=right>Hello, world!</div> is not easier for end users than <div align="right">Hello, world!</div>, I would say its harder
2. a tool can tell the end user that the former is right, if you depend on a variation of permissive parsers, you cant be told wether it will be accepted.
3. end users shouldnt be using any of these, they should really be tool generated.
MBAs think in terms of PRIORITIES.
Really, who in their right mind uses a homebrew XML parser for anything that matters? And that point just blows away the article's entire argument, at least with respect to XML.
Then maybe we wouldn't have this problem.
YAML is great for exchanging data but SGML and YAML target different problems.