

HTTP as Imagined versus HTTP as Found - denzil_correa
http://programmingisterrible.com/post/50237666844/http-as-imagined-versus-http-as-found

======
kabdib
In the late 1990s I was at a start-up doing a hardware-based HTTP proxy. The
cases you have to handle of people screwing up the protocol are mind numbing.
For a while, every day was a new special case. Every stack out there has it's
own special kink.

I would love to see standards come with reference code and executable tests.
We do this for crypto because "crypto is hard" and nobody expects implementors
to get RSA or SHA correct in a vacuum -- there are published test vectors.

Why don't we do the same with other protocols? I think we've proven that it's
possible to get even really simple standards (... which HTTP isn't, by the
way) incorrect. You won't find stuff like race conditions, but you'll catch
that CRLF versus LF nonsense.

[HTTP is also a horrible standard, but I don't think we have a good handle on
not letting that kind of thing get out in the wild.]

~~~
GhotiFish
Reminds me of TOML

someone made a battery of tests that TOML parsers have to pass.

<https://github.com/BurntSushi/toml-test>

It tests a ton of edge cases and communicates clear expectations. Test suites
are a great way to enforce an implementations adherence to the specification.

~~~
to3m
Sadly there's nothing official for HTTP. This strikes me as a bit daft,
because the spec allows a surprising amount of leeway, and there's actually a
surprising amount to it in terms of annoying fiddly syntax. (I have no idea in
whose interests this is intended to be, because both client and server have to
both read and write this stuff. If they'd decided to make one case complicated
so the other could be simple, that would be understandable, if barely less
annoying.)

And-HTTPD comes with some test data: <http://www.and.org/and-httpd/> (I've not
looked into this in any depth, except to note its existence, but I wish I'd
seen it when I was working on my server. The chap also has an amusing page
about the HTTP spec, which I basically agree with:
<http://www.and.org/texts/server-http>)

For Web Sockets, there's a good test suite here, but again, it's not official:
<http://autobahn.ws/testsuite>

------
zimbatm
It's really hard to implement a HTTP client or server properly. It takes years
to stabilize. That's why if you can use an existing solution. cURL is your
friend.

~~~
zzzcpan
I disagree. It's actually pretty easy. It might seem hard at first, but in
reality there are just a few things one needs to know to support all clients
correctly. Like disabling keepalive in some scenarios for some user-agents,
choosing proper compression, padding small non-200 responses, etc. And you can
look up all of those things in nginx or apache.

~~~
zimbatm
That's exactly the kind of attitude I am trying to warn against. I have
encountered times and times again people with that exact same attitude but I
have yet to see HTTP implementation that don't get bug fixes years after their
start.

From my experience, even an implementation with limited scope will need at
least a year to fully stabilize. Just point me to one and I'll be happy to
revise my judgement.

EDIT: As an example the http_parser in nodejs. It's just a small part of the
whole but still had bug fixes a year after nodejs started. I don't have access
to the repo right now (GH is down) but here is the source:
<https://github.com/joyent/http-parser>

~~~
zzzcpan
No, you are still wrong. I have written http parsers a few times in different
languages and probably can do it in my sleep. It was never hard. If you work
on web servers a lot you tend to remember all the quirks, but if you don't you
can always look into nginx. And you are probably thinking about different
kinds of bugs when talking about stabilization. The reason it takes a long
time for some implementations to stabilize is usually a poor language choice,
not a variety of incorrect protocol implementations. Sure, there are some
people who have no clue and try to make their own implementation without
looking into nginx or apache, so your concerns are right, but they don't apply
to everyone.

~~~
zimbatm
What kind of language are you using ?

The most successful implementation of an HTTP parser that I have seen in terms
of correctness/time-to-implement was done using Ragel for the mongrel ruby web
server (see <http://rubygems.org/gems/mongrel> ).

I have to look back again but if I remember well, the parser was written and
validated with fuzzy testing in a week or so and then was pretty much stable.
That was just for the parsing and the rest of the web server took a couple
more months to stabilize.

~~~
zzzcpan
> What kind of language are you using ? Go, C, Perl and others.

------
chacham15
As someone who has written his own HTTP client, I think that the author
greatly exaggerates the difficulty in doing so. The reason being that he
assumes that all clients are meant to be completely fault tolerant. The client
that I wrote was used for only a handful of websites all of which had 0 quirks
found to date.

However, I still dont get his examples:

> For example: Headers can span multiple lines.

This is by design: "Header fields can be extended over multiple lines by
preceding each extra line with at least one SP or HT" [1]. The example he
gives does not have the whitespace in front, but that is a particular
variation that isnt too hard to deal with.

> The length of a response body is indicated by a mixture of the response
> code, the Transfer-Encoding header, Content-Length header, Connection header
> (and the request method).

It has been my experience so far that if there is a content-length header,
then that is the length of the body end of story. The only case in which I
resort to using the connection header is the one in which there is no
transfer-encoding or content-length. I still have never come across a case in
which I needed to know the request method to be able to parse the result.

> Get messages can have a body, but not every server knows this.

I didnt know that. I would like to see more information about this too. Still,
this is irrelevant when writing a client.

In conclusion, this article points out that there may be a lot of variations
between implementation of the HTTP protocol, but I would have liked to know of
the actual servers and conditions which cause them to behave that way.

[1]<http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html>

~~~
__david__
> > Get messages can have a body, but not every server knows this.

> I didnt know that. I would like to see more information about this too.

I had an argument with a co-worker about this. The spec is a bit ambiguous
about GET requests with bodies.

Section 4.3 of RFC 2616 [1] says:

    
    
       A message-body MUST NOT be included in
       a request if the specification of the request method (section 5.1.1)
       does not allow sending an entity-body in requests. A server SHOULD
       read and forward a message-body on any request; if the request method
       does not include defined semantics for an entity-body, then the
       message-body SHOULD be ignored when handling the request.
    

Section 9.3 [2] is the section about GET requests and it never mentions
request bodies (they call them entities in section 9) at all.

Section 9.8 [3], "TRACE" says:

    
    
       A TRACE request MUST NOT include an entity.
    

So, 4.3 can be interpreted as: the spec does not disallow it, so it's allowed.
But it can also be interpreted as: it isn't explicitly allowed in the GET
section, so it isn't allowed. I think the wording is bad enough that it can be
interpreted either way, especially because some methods explicitly allow
request bodies (PUT, POST) and some explicitly disallow them (TRACE).

My personal opinion is that if they meant to disallow bodies in GETs then they
would have said "...does not _explicitly_ allow sending..." in section 4.3.

[1] <https://tools.ietf.org/html/rfc2616#section-4.3>

[2] <https://tools.ietf.org/html/rfc2616#section-9.3>

[3] <https://tools.ietf.org/html/rfc2616#section-9.8>

