
Harmful Consequences of Postel's Maxim - nabla9
https://tools.ietf.org/html/draft-thomson-postel-was-wrong-01
======
Animats
HTML suffers badly from this. Even basic lexical syntax, as in tag and comment
syntax, is not strictly enforced. Browsers became very forgiving, and in HTML
5, recovery from a long list of common errors, including comments with bad
syntax, was formalized in the spec. There are pages of error recovery gimmicks
defined in the spec, and HTML 5 parsers have to do all that stuff. (This is
one reason why html5parser for Python is so slow.) There used to be tricks for
doing different things in IE and Firefox based on how they handled errors.

On the first error, a browser should display an error bar in the middle of the
page, then proceed on a best-effort basis, ignoring most styling. Bad pages
would still be readable in emergencies, but they'd be annoying enough to get
fixed.

(Until you write a web crawler, you don't realize how bad HTML is in the wild.
There are low-level syntax errors. There are pages with more than one <html>,
<head>, or <body> section, or where those sections are out of order. There are
nesting errors which result in page trees a thousand deep. There are tags that
don't belong in HTML. There's syntactically incorrect ad code from major
vendors.)

~~~
gsnedders
> This is one reason why html5parser for Python is so slow.

I presume you mean html5lib and not the new html5-parser; html5lib isn't slow
because of error recovery: html5lib is slow because it's a parser written in
Python. The parsing performance is largely taken up, last I benchmarked the
VM, by allocation overhead (heck, reading s[0] of a string s _causes an
allocation_ ) and super-simple VM instructions (primarily dispatch overhead).

What I'd like to see is how much faster an HTML parser would be if it rejected
any document with a parse error at the first parse error; I doubt it'd be much
(you'd still need the branches to detect the parse error, though the code
might be smaller and you'd get better cache locality).

~~~
kbenson
Have a super strict engine that parses as fast as possible, and do a full fail
on a parse error and fall back to a more lax engine and start parsing from
scratch. People that care about speed will notice that they can get a speedy
page by being strict, and those that don't care will see a slight slowdown in
their page times. Automatic incentivization of good future behavior.

~~~
inimino
This idea comes up a lot, but it is based on faulty assumptions. First, time
parsing HTML is irrelevant compared to network latency and even DOM and CSS
work that browsers do. Second, a parser that handled only, say, strict XML
would certainly be easier for the parser author to write, but it would be
slower than the optimized parsers browsers already have for HTML. If
optimized, it still would not be significantly faster than a full HTML5
parser.

Your idea wouldn't work because it would never create any noticeable impact on
page loading speed even in the best case scenario. That's assuming that
browsers want to maintain and ship two totally different parsers, which (cf.
XHTML) they don't.

~~~
kbenson
> time parsing HTML is irrelevant compared to network latency and even DOM and
> CSS work that browsers do.

This is likely true for most sites on the internet. It is worth noting that on
an overly large HTML file of ~1.4 MB, stuffing that into an HTML element
object by assigning innerHTML with that content takes approximately 75 ms on
my i7-6700HQ.[1] A much larger file of 10 MB took 656 ms to parse, which seems
to indicate it scales linearly and processes ~16KB/ms (16MB/s). For the most
heavy real sites I could find, it generally took no longer than 25 ms.

It's worth noting that 16MB/s is actually only slightly faster than 100Mb/s,
so some people may actually take longer parsing than receiving the HTML. That
said, the parsing/processing tested here may be doing more than what we
strictly care about for this diuscussion, so I'm not including it as a
rebuttal, but as a point of reference that thought was interesting.

1: var el = document.createElement('html'); console.log((new
Date()).getTime()); el.innerHTML = htmlstr; console.log((new
Date()).getTime());

------
jasode
The IETF title makes it sound like developers everywhere were using Postel's
words as guidance.

Instead, relaxed or permissive parsing of files/protocols/API _even though
they have well-documented specifications_ is an unavoidable emergent
phenomenon. The various examples across domains is fascinating:

\- HTML (missing tags, unpaired tags, etc). Developers may prefer to strictly
parse HTML and any non-compliance results in a "page error". But the
_websurfers want that page data_ so developers end up guessing the HTML
authors intentions and renders the page.

\- PDF (malformed pdf files that Adobe can read). Lots of utilities out there
create bad pdf files and Adobe Acrobat has lots of workarounds to read them.
Even though Adobe Inc _controls the pdf specification_ , even they succumb to
the pressure of adding workarounds to their parser to render broken pdf files.
To add to the insanity, _all 3rd-party industrial-strength pdf parsers_ end up
copying Adobe Acrobat's behavior to parse broken pdf files!

\- Win32 API. Programmers out in the wild will notice an undocumented behavior
of an API that's not in the contract and _start to depend on it._ When a new
version of Windows is released that removes the undocumented behavior and
breaks the 3rd-party app, the customer blames the "Microsoft Windows upgrade"
and not Quicken or videogame company. Raymond Chen has written several
articles on Microsoft adding "shims" to Windows codebase to help "guarantee
compatibility" for misbehaving apps incorrectly using the Win32 API. If the
3rd-party app is important enough to consumers that it prevents them from
upgrading Windows (in other words -- pay Microsoft money), it means Microsoft
will bend over backwards to accommodate the badly-written code from the
vendor.

The external forces are too great to enforce perfect discipline of strict
parsing across all parties. Even the big proprietary companies like Microsoft
and Adobe can't enforce their own standards-compliant parsing. An open
standard from IETF would have no chance at all.

We'd like to think of a "computing standard" as some inviolable contract but
history has shown it is actually an organic (and unspoken) "social" agreement.
This is unavoidable unless we all agree to have all software "approved" by a
central authority before anyone can download or use it.

~~~
dozzie
> Developers may prefer to strictly parse HTML and any non-compliance results
> in a "page error". But the _websurfers want that page data_ [...]

Note that similar thing was happening with SSL/TLS connections for a long
time: developers may prefer to strictly verify X.509 certificates, but
websurfers want that page data. The result of developers bowing to accepting
invalid certificates was hurting everybody.

Nowadays the situation is much better, because browser developers made
visiting sites with self-signed, expired, or incorrectly named certificates
significantly harder.

~~~
jasode
_> developers may prefer to strictly verify X.509 certificates, but websurfers
want that page data._

I categorize that scenario in a different bucket because novice users don't
understand the security implications of broken TLS. So, the users don't want
that data _but they don 't know it_. The developers in this case are looking
out for the user. (Same security situation as developers helping the user by
restricting cross-site scripting or address bar hijacking.)

To me, that's not the same as the forgiving parsers of HTML and PDF. Whether
the HTML is missing </p> closing tags like this:

    
    
      <p>This is paragraph 1.
      <p>This is paragraph 2.
    

Or is 100% compliant like this:

    
    
      <p>This is paragraph 1.</p>
      <p>This is paragraph 2.</p>
    
    

... the websurfer just wants the page rendered in both cases.

~~~
dozzie
> To me, that's not the same as the forgiving parsers of HTML and PDF.

Oh, of course it's not the same. The point is, in both cases users want
strictness, but don't know that they want. Users want web pages that are easy
for their browsers to parse; "want" to different degree, yes, even an indirect
"want", but still.

If browsers fail to display web page with invalid markup, developers are
forced to fix their sh&t instead of adding another exception to an already big
pile of "steaming let's not" in the standard. And we had a precedent of going
down this way with X.509 certificates. I think it's a pity that the way is too
hard for something not as critical as SSL/TLS.

------
jimrandomh
The problem is that there are two very different contexts: users and
developers. If a developer is testing their new program by having it interface
with your program, you want to be strict in what you accept; this enables them
to make their program strict in what it emits, and helps bugs surface early.
But if an end-user is hooking your program up to some other program, then you
want to be generous in what you'll accept, so that it'll work.

Compilers solve this by having an intermediate levels of strictness: warnings,
which a developer who's providing input to the compiler is expected to fix,
but which an end-user doesn't have to. This captures the benefits of both
strict and lax input checking.

It helps to make similar distinctions in other contexts, whether that's having
an explicit notion of "warnings", or just putting messages in log-file output.
It also helps if a format has a well-known extra-strict checker around, which
developers can test their programs' output on.

~~~
zAy0LfpBZLC8mAC
> But if an end-user is hooking your program up to some other program, then
> you want to be generous in what you'll accept, so that it'll work.

No, you absolutely don't. The only reason why you would possibly need that in
the first place is because software tends to be far too forgiving. Where
software enforces procotol/format compliance, a normal end user essentially
does not ever get to see any deviating input because generating software is
written to the standard right from the start, and the few cases where that
fails, the actual bug gets fixed.

Also, you actually can't. When the standard does not specify the meaning of
some input, then there is no meaning. Whatever meaning you invent for it is
just that: Your invention. The next software likely will interpret things
differently. You do not actually have a standardized format when multiple
implementations speak the same syntax, they also need to have the same
semantics. Just doing something in response to broken input does not produce
interoperability, but just undefined behaviour.

------
nabla9
I like /r/netsettelr's comments in reddit.

[https://www.reddit.com/r/programming/comments/6u1jq2/the_har...](https://www.reddit.com/r/programming/comments/6u1jq2/the_harmful_consequences_of_postels_maxim_an/dlpe1hx/)

------
Bartweiss
Interestingly, I seem to have the opposite complaint about XML parsers. People
generate crappy, broken XML all the time, but many of the services that digest
it are horribly picky and unwilling to take even valid inputs.

Some refuse empty fields and demand they not be included as tags. Some demand
a specific list of tags, and insist all be included even if empty. Some
clearly have non-recursive parsing, only honor certain depths of tree. Some
have completely hard-coded parsers, and only accept certain tags _in certain
orders_. The list goes on.

Is this just Sturgeon's Law, where things are broken in both directions? Or is
there some predictable reason that some tools become overly permissive, and
others overly restrictive?

~~~
captainmuon
I have the same with JSON. Sure, when I'm using JSON/REST APIs everything
works fine, but I deal a lot with JSON on disk, and it happens a lot that you
have sloppy JSON because people hand-edit it. Trailing commas and comments are
the most common, but sometimes you just have js/python object literals dumped
into a file.

In every language I work in, I find myself writing a lenient JSON parser. So
far, Python, C++, and Nim (for JS, there is already JSON5).

I am also working on a whitespace and comment preserving parser that round-
trips, so you can change a value in a configuration file without loosing
comments. (Currently, I always produce standard JSON so compatibility is not
an issue.)

I have been thinking about publishing these parsers, but after seeing how much
abuse and nasty comments the JSON5 guy got, I have been holding off...

~~~
nly
> I am also working on a whitespace and comment preserving parser that round-
> trips, so you can change a value in a configuration file without loosing
> comments.

I am 80% done writing such a parser for a over-engineered legacy config file
format. It isn't that hard really, you just need a parser framework/library
that can output a complete parse tree (a tree where a preorder traversal
covers every byte in the input contiguously, exactly once).

The remaining 20% of the work is keeping track of the line and column number
(almost there), cleaning up the code, and being able to reserialize the AST to
a binary format (for reloading in another process where I don't want to have
to rerun the original PEG)

------
amelius
If developers can't even be bothered to get protocols right, how the hell are
they going to get security right?

------
akkartik
[https://www.joelonsoftware.com/2008/03/17/martian-
headsets](https://www.joelonsoftware.com/2008/03/17/martian-headsets)

------
RcouF1uZ4gsC
I think the benefits of Postel's Maxim, have far outweighed the negatives. One
of the reasons why the web took off, is that normal everyday non-geeky people
could in a few minutes crank out a webpage that could be viewed by the whole
world. The browsers just made it work and generally look OK.

If we did not have Postel's Maxim, I think the Internet would have just
remained the domain of geeks and not have become what it is today.

~~~
zAy0LfpBZLC8mAC
That's bullshit. It's not hard to crank out a standards-compliant web page if
your browser tells you where what you wrote is broken. Certainly a lot easier
than cranking out a web page that looks good in a bunch of browsers that all
have their own idea of what HTML means. If you think that browsers just made
things work, you never tried to build a web page that actually displayed
correctly in both IE and Netscape at the time.

------
MichaelBurge
I've worked a few APIs that are "liberal in what they emit, and conservative
in what they accept", which gets you the worst of both worlds.

------
js2
Why do the text versions of this draft have the header "Elephants Out, Donkeys
In?" Is that political commentary unrelated to the draft?

------
fizixer
I like the webpage format. They even made sure (apparently) that printing
preserves the layout and page numbering.

~~~
0xffff2
Interesting. I find it to be one of the most egregious examples of bad
skeuomorphic design. It makes no sense to me to break documents up into pages
when viewing them on the web, and in Firefox's print preview it looks like the
"pages" won't even print properly (the last line or two gets bumped to the
next page).

~~~
zAy0LfpBZLC8mAC
Except that's not really what it is. RFCs predate the web, and were originally
written in this format, and with existing standards you don't want to
unnecessarily change anything, because you never know whether you maybe change
the meaning somewhere as a side effect, and you also want to have consistency
between standards from the same standards body, so the format was kept, and
also, a plain text format was long seen as having the big advantage that it is
not easily corrupted by bad rendering software, so there was just some HTML
markup added after the fact to help with navigation and such, much while
otherwise preserving the presentation.

However, the IETF is moving to markup as the authoritative source, so the
pseudo plain text version might be losing relevance.

~~~
0xffff2
I see what you're saying, but I disagree that it's a substantial rebuttal to
my comment. For any IETF document that wasn't originally drafted on a
typewriter, this style is absolutely skeumorphic. I don't necessarily mind it
as a default, but I really wish the pure plain-text document ommited the
headers and footers.

~~~
zAy0LfpBZLC8mAC
Nah, it's not really skeuomorphic because it doesn't really imitate, at least
until recently. The authoritative format was plain text, and that was intended
to be printable (originally probably on what were in effect computer-driven
typewriters), and if you want to have printable plain text, you kindof have to
have headers and footers on every "page" (and page break characters between
pages in order to accomodate different paper sizes, which are actually there
if you look closely at the plain text version). The HTML version of those RFCs
does not so much imitate the format as it is a rendering that is supposed to
reproduce the authoritative document as true to the original as possible, so
as to not risk any unintended changes in meaning, while adding some of the
benefits of having hypertext.

------
kutkloon7
From the page which the top comment from two years [1] ago linked to: "This
statement is based upon a terrible misunderstand of Postel's robustness
principle. I knew Jon Postel. He was quite unhappy with how his robustness
principle was abused to cover up non-compliant behavior, and to criticize
compliant software.

Jon's principle could perhaps be more accurately stated as 'In general, only a
subset of a protocol is actually used in real life. So, you should be
conservative and only generate that subset. However, you should also be
liberal and accept everything that the protocol permits, even if it appears
that nobody will ever use it.'"

[1]
[https://news.ycombinator.com/item?id=9827669](https://news.ycombinator.com/item?id=9827669)

~~~
whipoodle
Interesting! That is definitely different from how the maxim is commonly
understood.

Edit: actually, some of the replies in the thread you linked disagree with
this remembrance, and state that it really did mean what we take it to mean
today.

~~~
jcranmer
The citations for the replies don't really back it up.

The context for the various iterations of Postel's Law generally suggest that
they're referring to the possibility that you might be seeing the result of
mismatching versions of a specification. The later iterations also give
reference to an explicit example of what they mean: don't assume that
enumerations in the specification are closed (i.e., assume that future
revisions may add additional enumerations). There is absolutely no evidence
that he is advocating trying to parse slop at all.

