Hacker News new | past | comments | ask | show | jobs | submit login
Programmers and Sadomasochism (alumnit.ca)
43 points by timf on Feb 22, 2009 | hide | past | favorite | 35 comments

I'm not sure who this author is, but it seems to me he's overlooking the enormous benefit that when a strict xml parser barfs on a file, it's immediately and unequivocally clear which party in a multi party business process is causing a problem and should fix things at their end.

If it weren't for xml, endless (and therefore costly) discussions would ensue instead.

Also, on the more technical side, a barfing xml parser saves you from having to spend hours hunting for subtle bugs introduced by a permissive parser that is permissive in ways no one really understands fully.

Option 4

Bob produces invalid XML file and gives it to Avery. Avery's permissive parser that he wrote in an afternoon reads it MOSTLY fine, but gets a few things wrong. Avery goes on with his work not realizing some of the data is incorrect, and Bob doesn't need to pay a contractor.

The premise is a basic misunderstanding of HTML. <div align=right> does not parse as HTML because HTML-parsers are pragmatic and obeys Postells law. It parses because it is syntactically correct HTML. HTML does not require quotes around attributes if the attribute values are alphanumeric.

But if you write syntactically incorrect HTML, like forgetting to match a quote (<div align="right>) you will have a problem. The parser will not show an error, but the following content may disappear. Worse, the result might be diffent in different browsers. In the one you test with, things render like you intended due to the browser-specific error recovery rules. In the browser used by most of your audience, everything on the page until the next quote (in a different tag further down the page) disappers without trace.

HTML is a more complex grammar than XML which makes it harder to implement correctly. XML OTOH was designed with the KISS-principle. Clearly some characters needs to be quoted - eg. equals-signs or whitespace in an attribute value simply cannot be parsed unambiguously. Rather than having rules about which characters require quotes (does an underscore in an attribute require quotes? Does a dot?) XML just requires quotes always.

The poster argues that a parser should be able to parse malformed input, since there might not be an easy way to get the producer to conform to the spec. This is a good point, but there is no simple solution, because invalid XML is by definition ambiguous. In many cases it is possible to guess what the producer intended, however it basically have to be decided on a case by case base. You can't write a parser that can parse random invalid XML and always reconstuct the data structure that the producer intended. This would require that the parser was able to read the mind of the producer, and if that was the case, XML would not be needed at all.

In HTML5 there is activity to define canonical parse rules for almost any kind of malformed input. This allows more interoperable parsers, but the price is an extremely large and complex specification. Note also that this does not automatically fix mistakes. An HTML author might forget one of the quotes around an attribute value; with HTML5 there might be an official correct parse for this case. However there is no guarantee that this parse is what the author intended. There is simply no fool-proof way to fix content bugs on the client side. Postells Law have clear limits.

While the author does bring up a good point, I feel that he just skims the surface of the major drawbacks of a permissive parser:

1. It is hell to write a new parser for convoluted standards.

2. permissive parsers have to waste more CPU cycles trying to fix everything (admittedly this is not as big of a problem in the current day and age)

3. Weak standards leads to badly coded pages. Badly coded pages generally have bad design. Badly designed pages don't help anyone.

4. It drives the people who care crazy (Ok, maybe that's not a real reason :))

5. I'm sure there are others I can't think of.

Lets face it, its always best to accept poor formats. If you write the program to handle it and clean it then your fine. If one program before it has a bug and forgets a quotations then instead of your whole system failing and you needing to go through a long process of testing debugging documenting releasing you save everyone time and money.

I disagree. Postel's Law can be harmful. It's nice in theory, and if everyone was "conservative in what [they] send" or "liberal in what [they] accept" in the same way, it would work well.

But it's a slippery slope. If one implementation accepts some obscure edge case, and someone relies on it, from then on every implementation must go out of it's way to handle it in the same way. This leads to complex and bloated software.

You provide a good argument, but what I was trying to communicate was that the sender is still sending an standard format. The coder just makes a bug in his XML and instead of bringing down the whole house its handled.

XML isn't "this sort of thing with tags and attributes". It's a rigidly-specified format, and for any given series of bytes and an encoding, it either is or is not XML. If there's a "bug in his XML", it isn't XML anymore.

You can talk about what to do with this not-XML, but you can't just pretend it is XML. It isn't, because XML is not just this fuzzy thing, it's a very, very, very specific thing.

If the parser allows the programmer to be sloppy they may not know they're making a mistake. If they try to use another parser that doesn't handle their mistake the same way, it will break.

Worse, in the case of browsers, each browser vendor wants to be sure every major site works in their browser, so they have to go out of their way to program in all these special cases. See tolmasky's comment above.

Why do we need to spell correctly? Why can't we use "u" for "you"? Why do we need to end sentences with periods? Why do we have to use the correct grammar?

1) It's simpler for humans to read. The purpose of XML was to be somewhat human readable and quotes help with that. 2) It makes possible for people to throw together a basic parser if for some reason they don't have access to the libraries (embedded, new language, etc) 3) Life would be a lot simpler if everyone followed the law (or specs).

Well, if you're going to play trump cards, http://news.ycombinator.com/item?id=447086.

There's another missing point here. Where is the line between "slightly broken xml" and "horribly broken xml, since they actually sent a csv file". I could argue that a csv file is simply really bad xml, why can't you parse it?

Well, its exactly this mentality that got us the web as we have it today (for better or worse). Clearly there are upsides to all this (some of which are covered in this article), but there are also many real downsides that the author completely misses.

For starters, it is now prohibitively difficult to create a brand new browser from scratch because you have to check against the millions of existing pages on the web, instead of simply implementing a standard. This is also partially why the web moves so slowly: I can tell you first hand from working on a browser that a lot of time that could have been spent implementing new features is instead spent "making things work like IE". Thanks to this, we now have a web where valid things don't work (huge pieces of HTML 4 and CSS 3 are generally missing), while crazy invalid things do work. There is limited time to work on things, and it is generally considered more important to emulate someone else's bugs so that the user won't go back to IE than implement things that have been in the standard since forever. Essentially what we have is the Parser's Prisoner's Dilemma.

One of the key features in our product (280Slides's PPT import/export) would have probably not been possible had Microsoft decided to accept any willy nilly OOXML. But, since they only read strict versions of files, as long as we conform to the spec we will be able to open (and compete) with PowerPoint.

Of course, all this misses the main point, which is that all this is really about is humand-editable file formats vs program-editable file formats. Most file formats (.doc .rtf .psd .ppt etc etc) don't have these strange problems with accepting bad input, and that's because they aren't written by hand. HTML is very much a hand written language and thus requires this. Many times XML does not have this same constraint. A lot of time it is applications that are generating XML, and so it is not so ridiculous to expect for it to be correct. So a good rule of thumb should be whether you expect the incoming data to be "machine generated" or "human generated".

The end "side note" that rendering on the web is different SOLELY due to implementation details and not at all due to parsing is also just plain false. Browsers are designed to accept all forms of tag soup, and when you have a web page with all open tags and absolutely no closing tags, guess what, your guesses as to what on earth it means is probably going to be different than that of some other browsers', and it will lead to different rendering.

This brings us to a problem that actually hurts users: part of the reason I have to check against every browser when I make a web page instead of just one is that it is not enough to see whether something is correct in one browser, because that browser may just be "being nice" to me. So, since correct rendering is not a sign of correct HTML, I must now test every browser to see whether it renders correctly as well. Thanks to this, we get the introduction of even MORE technologies, such as "strict" rendering, so the standard becomes even more convoluted.

Agreed. Furthermore, there's a big difference between what HTML and XML are typically used for. HTML is usually used to present some data to a human. It doesn't matter too much if it looks slightly different in different browsers as long as it conveys the same information. But XML is used for storing and exchanging data. If different consumers interpret ambiguous cases (such as missing closing tags) differently, that could be a huge problem.

My favorite thing about browsers is that they sniff content that is sent to them to see what it is. For several different cases such as when the web server does not sent the content type or to see if what was sent is probably a gif or javascript or maybe xml even if you say it is something else. The amount of sniffing and guessing is really amazing. The spec requires you must specify a content type. Guessing what a file is just seems like a world of hurt and pain in the long run.

There's also a security reason for attributes to be enclosed in quotes when we're talking about dynamic webpages. Avery's permissive parser might parse the page "correctly," including an additional maliciously injected attribute like onclick=sendcookie(), which wouldn't have been possible otherwise.

Too bad we don't use s-expressions.

How would that fix the problem? If you receieve invalid s-espression data - e.g. with unmatched parantheses or an unmatched quotation sign - how would you parse it into the structure that the author intended?

It wouldn't fix the problem, but since sexps aren't so damned ugly you probably wouldn't see as many errors in the first place.

Here is xhtml for a link with mixed content:

   <a href="http://news.ycombinator.com">News for <b>Hackers</b>!</a>
How would you suggest equivalent s-expression syntax that leads to fewer errors due to its beauty?

Every Lisp hacker has written their own version of XML as s-expressions. Here's one that actually has users other than its implementor: http://okmij.org/ftp/Scheme/SXML.html

That snippet would look like:

    (a (@ (href "http://news.ycombinator.com")) "News for" (b "Hackers") "!")
I'm an Emacs user, so paren-based navigation, highlighting of areas within matching parens, and concision because of the lack of closing tags wins for me. Maybe for somebody using MS Notepad it wouldn't be so great, since they don't have paren-based editing features so closing tags may actually be useful.

The snippet you showed is so small that XML's verbosity doesn't become a problem, so it doesn't really matter in this case.

I honestly dont understand why you think the sexpr syntax will lead to fewer errors? The syntax is just as complex, and there seem to be plenty of quote signs and nested parentheses to forget or mismatch.

Of course I don't doubt that you are able to write syntactically correct sexprs with the help of Emacs. A great number of tools (including Emacs I'm pretty sure) can help you the same way with XML. But for whatever reason - buggy software, bad editors, uneducated developers - invalid XML still happen. Why do you think substituting pointy brackets with rounded ones will change that?

>I honestly dont understand why you think the sexpr syntax will lead to fewer errors?

I already backed off on the strong version of the claim - that s-expressions would lead to fewer errors in general for the average person writing XML - but I still prefer s-expressions because as a Scheme programmer I am accustomed to using sexps and I find them cleaner and easier to us. People who don't use or don't like Scheme or Lisp probably wouldn't feel the same way. It would probably be harder to use sexprs in MS Notepad, like I said. Maybe even in vim.

>The syntax is just as complex, and there seem to be plenty of quote signs and nested parentheses to forget or mismatch.

Any decent text editor will highlight strings, so it will be difficult to mismatch quotes. In Emacs, I like paredit-mode, which inserts a closing paren every time you enter an opening paren, so you always have a balanced number of parentheses. Combined with editor features that highlight whichever s-expression the cursor is in, and auto-indentation which sticks out like a sore thumb if you make a mistake, it's rare to mismatch parentheses and easy to fix if you do.

I'm aware that there are Emacs modes for editing XML, and I tried them, but I still found it worthwhile to write in SXML and then convert to XML, because it's just cleaner and easier. Partly this is just because you use the same editor mode for editing an s-expression-based XML-alike as you would use to edit code, so the commands are already wired into your spinal reflexes, and bound to the most convenient keys. But it's also less verbose when you get rid of all the closing tags and half of the brackets, so you can see more of the data at a time.

Also, if you're using an s-expression based language (and I know most people aren't), it's convenient to use a data format which is simply part of your language itself.

Of course, YMMV. If you're using, say, Python, you might prefer to use something like looks like Python's dictionary syntax (JSON?). I don't really know what Python programmers do. But I do know that there's a reason that there are a jillion XML-knockoffs with different syntaxes - XML is friggin ugly.

Don't you know? If you put a bunch of parentheses near each other, they form an AI that can parse invalid S-Expressions. I thought everyone knew that!

The syntactic bloat of XML brings out many of these problems.

But we do. Today they're called XML. :-)

I dont really understand this argument, 1

1. <div align=right>Hello, world!</div> is not easier for end users than <div align="right">Hello, world!</div>, I would say its harder

2. a tool can tell the end user that the former is right, if you depend on a variation of permissive parsers, you cant be told wether it will be accepted.

3. end users shouldnt be using any of these, they should really be tool generated.

Programmers think in terms of what is RIGHT and WRONG.

MBAs think in terms of PRIORITIES.

MBAs should think in terms of firing the programmer who writes his own XML parser or generator, or uses a language lacking a mature XML library for production work.

Really, who in their right mind uses a homebrew XML parser for anything that matters? And that point just blows away the article's entire argument, at least with respect to XML.

Wrong. :-\

I wish someone had invented YAML before SGML existed.

Then maybe we wouldn't have this problem.

SGML actually has some really nice properties explicitly designed to better handle parsing invalid files. The redundant close tags, for example, allow <b>this is bold <i>and italic</b> but not bold</i> -- these sorts of errors are made by people and it was a goal to handle those sorts of situations. Whether or not that was a good goal? Well, I guess that's part of this debate.

So, how do you parse invalid YAML unambiguously?

Don't insult SGML!! SGML is actually quite nice. With omitable and case-insensitive tags, its pretty easy to work with if it is used in its intended domain: document markup.

YAML is great for exchanging data but SGML and YAML target different problems.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact