
Programmers and Sadomasochism - timf
http://alumnit.ca/~apenwarr/log/?m=200902#22
======
mtts
I'm not sure who this author is, but it seems to me he's overlooking the
enormous benefit that when a strict xml parser barfs on a file, it's
immediately and unequivocally clear which party in a multi party business
process is causing a problem and should fix things at their end.

If it weren't for xml, endless (and therefore costly) discussions would ensue
instead.

Also, on the more technical side, a barfing xml parser saves you from having
to spend hours hunting for subtle bugs introduced by a permissive parser that
is permissive in ways no one really understands fully.

------
icefox
Option 4

Bob produces invalid XML file and gives it to Avery. Avery's permissive parser
that he wrote in an afternoon reads it MOSTLY fine, but gets a few things
wrong. Avery goes on with his work not realizing some of the data is
incorrect, and Bob doesn't need to pay a contractor.

------
olavk
The premise is a basic misunderstanding of HTML. <div align=right> does not
parse as HTML because HTML-parsers are pragmatic and obeys Postells law. It
parses because it is syntactically correct HTML. HTML does not require quotes
around attributes if the attribute values are alphanumeric.

But if you write syntactically incorrect HTML, like forgetting to match a
quote (<div align="right>) you _will_ have a problem. The parser will not show
an error, but the following content may disappear. Worse, the result might be
diffent in different browsers. In the one you test with, things render like
you intended due to the browser-specific error recovery rules. In the browser
used by most of your audience, everything on the page until the next quote (in
a different tag further down the page) disappers without trace.

HTML is a more complex grammar than XML which makes it harder to implement
correctly. XML OTOH was designed with the KISS-principle. Clearly some
characters needs to be quoted - eg. equals-signs or whitespace in an attribute
value simply cannot be parsed unambiguously. Rather than having rules about
which characters require quotes (does an underscore in an attribute require
quotes? Does a dot?) XML just requires quotes always.

The poster argues that a parser should be able to parse malformed input, since
there might not be an easy way to get the producer to conform to the spec.
This is a good point, but there is no simple solution, because invalid XML is
by definition ambiguous. In many cases it is possible to guess what the
producer intended, however it basically have to be decided on a case by case
base. You can't write a parser that can parse random invalid XML and always
reconstuct the data structure that the producer intended. This would require
that the parser was able to read the mind of the producer, and if that was the
case, XML would not be needed at all.

In HTML5 there is activity to define canonical parse rules for almost any kind
of malformed input. This allows more interoperable parsers, but the price is
an extremely large and complex specification. Note also that this does not
automatically fix mistakes. An HTML author might forget one of the quotes
around an attribute value; with HTML5 there might be an official correct parse
for this case. However there is no guarantee that this parse is what the
author intended. There is simply no fool-proof way to fix content bugs on the
client side. Postells Law have clear limits.

------
johngunderman
While the author does bring up a good point, I feel that he just skims the
surface of the major drawbacks of a permissive parser:

1\. It is hell to write a new parser for convoluted standards.

2\. permissive parsers have to waste more CPU cycles trying to fix everything
(admittedly this is not as big of a problem in the current day and age)

3\. Weak standards leads to badly coded pages. Badly coded pages generally
have bad design. Badly designed pages don't help anyone.

4\. It drives the people who care crazy (Ok, maybe that's not a real reason
:))

5\. I'm sure there are others I can't think of.

~~~
peregrine
Lets face it, its always best to accept poor formats. If you write the program
to handle it and clean it then your fine. If one program before it has a bug
and forgets a quotations then instead of your whole system failing and you
needing to go through a long process of testing debugging documenting
releasing you save everyone time and money.

~~~
tlrobinson
I disagree. Postel's Law can be harmful. It's nice in theory, and if everyone
was "conservative in what [they] send" or "liberal in what [they] accept" in
the _same_ way, it would work well.

But it's a slippery slope. If one implementation accepts some obscure edge
case, and someone relies on it, from then on every implementation must go out
of it's way to handle it in the same way. This leads to complex and bloated
software.

~~~
peregrine
You provide a good argument, but what I was trying to communicate was that the
sender is still sending an standard format. The coder just makes a bug in his
XML and instead of bringing down the whole house its handled.

~~~
jerf
XML isn't "this sort of thing with tags and attributes". It's a rigidly-
specified format, and for any given series of bytes and an encoding, it either
is or is not XML. If there's a "bug in his XML", it isn't XML anymore.

You can talk about what to do with this not-XML, but you can't just pretend it
is XML. It isn't, because XML is not just this fuzzy thing, it's a very, very,
_very_ specific thing.

------
codeismightier
Why do we need to spell correctly? Why can't we use "u" for "you"? Why do we
need to end sentences with periods? Why do we have to use the correct grammar?

1) It's simpler for humans to read. The purpose of XML was to be somewhat
human readable and quotes help with that. 2) It makes possible for people to
throw together a basic parser if for some reason they don't have access to the
libraries (embedded, new language, etc) 3) Life would be a lot simpler if
everyone followed the law (or specs).

------
llimllib
Mark Pilgrim did it better:

[1]: <http://diveintomark.org/archives/2004/01/08/postels-law>

[2]:
[http://diveintomark.org/archives/2004/01/14/thought_experime...](http://diveintomark.org/archives/2004/01/14/thought_experiment)

~~~
gruseom
Well, if you're going to play trump cards,
<http://news.ycombinator.com/item?id=447086>.

------
cschneid
There's another missing point here. Where is the line between "slightly broken
xml" and "horribly broken xml, since they actually sent a csv file". I could
argue that a csv file is simply really bad xml, why can't you parse it?

------
tolmasky
Well, its exactly this mentality that got us the web as we have it today (for
better or worse). Clearly there are upsides to all this (some of which are
covered in this article), but there are also many real downsides that the
author completely misses.

For starters, it is now prohibitively difficult to create a brand new browser
from scratch because you have to check against the millions of existing pages
on the web, instead of simply implementing a standard. This is also partially
why the web moves so slowly: I can tell you first hand from working on a
browser that a lot of time that could have been spent implementing new
features is instead spent "making things work like IE". Thanks to this, we now
have a web where valid things don't work (huge pieces of HTML 4 and CSS 3 are
generally missing), while crazy invalid things do work. There is limited time
to work on things, and it is generally considered more important to emulate
someone else's bugs so that the user won't go back to IE than implement things
that have been in the standard since forever. Essentially what we have is the
Parser's Prisoner's Dilemma.

One of the key features in our product (280Slides's PPT import/export) would
have probably not been possible had Microsoft decided to accept any willy
nilly OOXML. But, since they only read strict versions of files, as long as we
conform to the spec we will be able to open (and compete) with PowerPoint.

Of course, all this misses the main point, which is that all this is really
about is humand-editable file formats vs program-editable file formats. Most
file formats (.doc .rtf .psd .ppt etc etc) don't have these strange problems
with accepting bad input, and that's because they aren't written by hand. HTML
is very much a hand written language and thus requires this. Many times XML
does not have this same constraint. A lot of time it is applications that are
generating XML, and so it is not so ridiculous to expect for it to be correct.
So a good rule of thumb should be whether you expect the incoming data to be
"machine generated" or "human generated".

The end "side note" that rendering on the web is different SOLELY due to
implementation details and not at all due to parsing is also just plain false.
Browsers are designed to accept all forms of tag soup, and when you have a web
page with all open tags and absolutely no closing tags, guess what, your
guesses as to what on earth it means is probably going to be different than
that of some other browsers', and it will lead to different rendering.

This brings us to a problem that actually hurts users: part of the reason I
have to check against every browser when I make a web page instead of just one
is that it is not enough to see whether something is correct in one browser,
because that browser may just be "being nice" to me. So, since correct
rendering is not a sign of correct HTML, I must now test every browser to see
whether it renders correctly as well. Thanks to this, we get the introduction
of even MORE technologies, such as "strict" rendering, so the standard becomes
even more convoluted.

~~~
tlrobinson
Agreed. Furthermore, there's a big difference between what HTML and XML are
typically used for. HTML is usually used to present some data to a human. It
doesn't matter _too_ much if it looks slightly different in different browsers
as long as it conveys the same information. But XML is used for storing and
exchanging data. If different consumers interpret ambiguous cases (such as
missing closing tags) differently, that could be a huge problem.

------
medearis
There's also a security reason for attributes to be enclosed in quotes when
we're talking about dynamic webpages. Avery's permissive parser might parse
the page "correctly," including an additional maliciously injected attribute
like onclick=sendcookie(), which wouldn't have been possible otherwise.

------
DavidSJ
Too bad we don't use s-expressions.

~~~
olavk
How would that fix the problem? If you receieve invalid s-espression data -
e.g. with unmatched parantheses or an unmatched quotation sign - how would you
parse it into the structure that the author intended?

~~~
rsheridan6
It wouldn't fix the problem, but since sexps aren't so damned ugly you
probably wouldn't see as many errors in the first place.

~~~
olavk
Here is xhtml for a link with mixed content:

    
    
       <a href="http://news.ycombinator.com">News for <b>Hackers</b>!</a>
    

How would you suggest equivalent s-expression syntax that leads to fewer
errors due to its beauty?

~~~
rsheridan6
Every Lisp hacker has written their own version of XML as s-expressions.
Here's one that actually has users other than its implementor:
<http://okmij.org/ftp/Scheme/SXML.html>

That snippet would look like:

    
    
        (a (@ (href "http://news.ycombinator.com")) "News for" (b "Hackers") "!")
    

I'm an Emacs user, so paren-based navigation, highlighting of areas within
matching parens, and concision because of the lack of closing tags wins for
me. Maybe for somebody using MS Notepad it wouldn't be so great, since they
don't have paren-based editing features so closing tags may actually be
useful.

The snippet you showed is so small that XML's verbosity doesn't become a
problem, so it doesn't really matter in this case.

~~~
olavk
I honestly dont understand why you think the sexpr syntax will lead to fewer
errors? The syntax is just as complex, and there seem to be plenty of quote
signs and nested parentheses to forget or mismatch.

Of course I don't doubt that _you_ are able to write syntactically correct
sexprs with the help of Emacs. A great number of tools (including Emacs I'm
pretty sure) can help you the same way with XML. But for whatever reason -
buggy software, bad editors, uneducated developers - invalid XML still happen.
Why do you think substituting pointy brackets with rounded ones will change
that?

~~~
rsheridan6
>I honestly dont understand why you think the sexpr syntax will lead to fewer
errors?

I already backed off on the strong version of the claim - that s-expressions
would lead to fewer errors in general for the average person writing XML - but
I still prefer s-expressions because as a Scheme programmer I am accustomed to
using sexps and I find them cleaner and easier to us. People who don't use or
don't like Scheme or Lisp probably wouldn't feel the same way. It would
probably be harder to use sexprs in MS Notepad, like I said. Maybe even in
vim.

>The syntax is just as complex, and there seem to be plenty of quote signs and
nested parentheses to forget or mismatch.

Any decent text editor will highlight strings, so it will be difficult to
mismatch quotes. In Emacs, I like paredit-mode, which inserts a closing paren
every time you enter an opening paren, so you always have a balanced number of
parentheses. Combined with editor features that highlight whichever
s-expression the cursor is in, and auto-indentation which sticks out like a
sore thumb if you make a mistake, it's rare to mismatch parentheses and easy
to fix if you do.

I'm aware that there are Emacs modes for editing XML, and I tried them, but I
still found it worthwhile to write in SXML and then convert to XML, because
it's just cleaner and easier. Partly this is just because you use the same
editor mode for editing an s-expression-based XML-alike as you would use to
edit code, so the commands are already wired into your spinal reflexes, and
bound to the most convenient keys. But it's also less verbose when you get rid
of all the closing tags and half of the brackets, so you can see more of the
data at a time.

Also, if you're using an s-expression based language (and I know most people
aren't), it's convenient to use a data format which is simply part of your
language itself.

Of course, YMMV. If you're using, say, Python, you might prefer to use
something like looks like Python's dictionary syntax (JSON?). I don't really
know what Python programmers do. But I do know that there's a reason that
there are a jillion XML-knockoffs with different syntaxes - XML is friggin
ugly.

------
daleharvey
I dont really understand this argument, 1

1\. <div align=right>Hello, world!</div> is not easier for end users than <div
align="right">Hello, world!</div>, I would say its harder

2\. a tool can tell the end user that the former is right, if you depend on a
variation of permissive parsers, you cant be told wether it will be accepted.

3\. end users shouldnt be using any of these, they should really be tool
generated.

------
known
Programmers think in terms of what is RIGHT and WRONG.

MBAs think in terms of PRIORITIES.

~~~
dkarl
MBAs should think in terms of firing the programmer who writes his own XML
parser or generator, or uses a language _lacking_ a mature XML library for
production work.

Really, who in their right mind uses a homebrew XML parser for anything that
matters? And that point just blows away the article's entire argument, at
least with respect to XML.

------
DannoHung
I wish someone had invented YAML before SGML existed.

Then maybe we wouldn't have this problem.

~~~
snprbob86
SGML actually has some really nice properties explicitly designed to better
handle parsing invalid files. The redundant close tags, for example, allow
<b>this is bold <i>and italic</b> but not bold</i> \-- these sorts of errors
are made by people and it was a goal to handle those sorts of situations.
Whether or not that was a good goal? Well, I guess that's part of this debate.

