
Node.js incorrectly parses HTTP methods - willlll
http://www.chmod777self.com/2013/08/sigh.html
======
chrismorgan
I've been working on implementing a solid HTTP library in Rust, currently at
[http://github.com/chris-morgan/rust-http](http://github.com/chris-
morgan/rust-http). Servo has been using Joyent's HTTP parser and this is a
problem that I had observed with it. Yes, it is pretty badly implemented, but
it's not a mess because of the style of code—that's for performance. It's only
a mess because it's inconsistent and incorrect.

Reading the HTTP method in a high-performance way does lead to superficially
ugly code. That's why code generation is good. Ragel has been mentioned and I
intend to seriously consider using it, but for the moment my own HTTP method
reading code is generated with:

    
    
        generate_branchified_method(
            writer,
            branchify!(case sensitive,
                "CONNECT" => Connect,
                "DELETE"  => Delete,
                "GET"     => Get,
                "HEAD"    => Head,
                "OPTIONS" => Options,
                "PATCH"   => Patch,
                "POST"    => Post,
                "PUT"     => Put,
                "TRACE"   => Trace
            ),
            1,
            "self.stream.read_byte()",
            "SP",
            "MAX_METHOD_LEN",
            "is_token_item(b)",
            "ExtensionMethod(%s)");
    

This is pleasantly easy to read and meaningful.

This generates the high performance artwork shown at
[http://sprunge.us/HdTH](http://sprunge.us/HdTH), which supports extension
methods correctly. (Rust's algebraic data types are marvellous for many things
in implementing such a spec.)

~~~
masklinn
> Ragel has been mentioned and I intend to seriously consider using it

Have you seen that there's a project for a Rust backend for Ragel[0] which
would allow direct reuse of the Mongrel HTTP 1.1 ragel spec?

[0] [https://github.com/erickt/ragel](https://github.com/erickt/ragel)

~~~
steveklabnik
As a Rust, Ruby, and HTTP enthusiast, this gets a hearty +1 from me as well.

It's been a while since I checked out the Ragel backend, does it still
generate decent Rust code?

~~~
masklinn
I have no idea, I've never used it.

------
kiwidrew
The most interesting part (to me) is that the server will handle e.g. "PUN" as
though it actually said "PUT". I wonder if this could be used as an attack
vector?

Sounds a lot like a "confused deputy" situation: imagine that your L7 firewall
has a rule to reject any PUT request, but it sees PUN and thus allows the
request to pass through to node.js, which then treats it as though it were
actually PUT.

~~~
sktrdie
This is why you don't usually expose Node directly to the internet. You use
NGINX or others in front of it! It's 0.10 for christ's sakes.

~~~
delian66
It is now 0.10.xxx ... and before that it was 0.9 ; probably when it becomes
0.99, it will go to 0.100 and so on ... I highly recommend to anyone to avoid
using Node in production until its developers grow up, and become responsible
for a stable API, and not change their minds every 2 months.

~~~
jeswin
Do you have any links that point to "the developers changing their minds every
2 months"? How have you been affected? As a node user for 2 years, my view is
that the API has been fairly stable of late.

Please back this up with evidence.

------
hosay123
The level of bikeshedded micro-optimization going on in that file is
hilarious. The whole thing could be swapped out with a Ragel parser and nobody
would notice a thing

~~~
jacobr
> To make a long story short: very old versions of http_parser were based on a
> Ragel grammar but the handwritten version outperformed it by a significant
> margin, as in over 10% fewer branches taken.

> Ragel is great but it can't take shortcuts like a human programmer can and
> once you start jumping around in your grammar, things become complicated
> fast (and impossible to debug.)

[https://github.com/joyent/http-
parser/pull/156#issuecomment-...](https://github.com/joyent/http-
parser/pull/156#issuecomment-22933119)

~~~
Filligree
So yeah, at least they get the wrong answer quickly.

------
chrismorgan
LINK was added _after_ HTTP/1.0, and _removed_ before HTTP/1.1.
[http://tools.ietf.org/html/draft-ietf-httpbis-method-
registr...](http://tools.ietf.org/html/draft-ietf-httpbis-method-
registrations-12) is being cited, but that is still a draft and the registry
that refers to does not yet exist. I believe it is thus fair to say that LINK
is not a standard method?

~~~
jasnell
To be certain, whether or not LINK is supported is not really the issue. The
file is mishandling HTTP methods in general for the sake of "optimization".
Had the parser been written correctly in the first place, it ought to have
been trivial for someone to add support for new extension methods like LINK,
but since the parser is broken, it becomes significantly more difficult.

~~~
chrismorgan
Oh, to be certain. The HTTP spec does have extension methods; any token is
valid as a method name. (Implementing the spec in Rust has taught me a lot
about HTTP.) I guess I should avoid the quibble of the particular example and
focus on the bigger picture :-)

------
general_failure
To people saying this is all micro optimization - have you guys actually
measured? If so, please post the numbers.

It's presumptuous to simply say 'all this is unnecessary' unless you have
measured it and we have no reason to believe the author hasn't measured it.

BTW, the file is copyright nginx.

------
andreineculau
A few points to make:

\- I value software that keeps to the spec, because it's the spec that I (as a
dev or non-dev) refer to. You never hear "NodeJS has HTTPish module", nor do
you read documentation of that module's concepts and behaviour. Those are
defined in the spec, and the __fill_in_with_any_language__ HTTP module just
implements those definitions.

\- Optimizations, simplifications, corrections should be done in the spec,
whenever the you find them at implementation-time.

But until now there has not been _ONE_ HTTP server that grasps and handles the
HTTP specs in their whole. So then, I find it hilarious to read that about
optimizations when neither of us have the whole picture.

That said, I don't think it's Node.js to blame here (albeit they do have weird
views of standards:
[https://github.com/joyent/node/issues/4850](https://github.com/joyent/node/issues/4850))
but HTTP itself because the spec's abstraction levels have been far away from
the implementations' reach. HTTPs concepts are gorgeous but they are worth nil
if implementation is "hard" and never done properly.

Longer story at: [http://andreineculau.github.io/hyperrest/2013-06-10-http-
hel...](http://andreineculau.github.io/hyperrest/2013-06-10-http-hell-no/)

------
spullara
There are plenty of good ways to optimize this code. They didn't pick any of
them. What I find more surprising than anything is that they didn't just
optimize GET and handle the rest generically.

------
sandfox
It should probably be noted that: a) you don't have use node's built in HTTP
server (yeah, I know, nearly everyone will), you are more thn free to write
your own or use one from it's module repository (npm) b) The entire HTTP
module is currently being over-hauled in the 0.11.X development branch and the
changes should appear in the stable 0.12.x releases.

Out of interest has anyone seen what other web servers support for these more
'esoteric' verbs is like?

~~~
sgustard
Nginx doesn't handle LINK:

[http://lxr.evanmiller.org/http/source/http/ngx_http_parse.c](http://lxr.evanmiller.org/http/source/http/ngx_http_parse.c)

~~~
rwg
For most of 2004, nginx didn't handle POST, either:

[http://trac.nginx.org/nginx/browser/nginx/src/http/ngx_http_...](http://trac.nginx.org/nginx/browser/nginx/src/http/ngx_http_parse.c?rev=da8c5707af39219683be654f6534bf14a97ef21c#L77)

~~~
ceol
Ah, the POTT method. Perhaps it's part of the HTCPCP?[0]

[0]:
[http://en.wikipedia.org/wiki/Hyper_Text_Coffee_Pot_Control_P...](http://en.wikipedia.org/wiki/Hyper_Text_Coffee_Pot_Control_Protocol)

------
nkuttler
Nice catch, reading the nodejs code is really discouraging. But the post
points to the relevant specs, so I guess this should be fixed rather sooner
than later. Initial response looks good:
[https://github.com/joyent/node/issues/6078](https://github.com/joyent/node/issues/6078)

------
lnanek2
Not a big deal. Even the REST people have long been just sending normal GET
and POST and including some indicator that they really want it treated as some
other verb.

~~~
steveklabnik
That has to do with HTML, not pure HTTP. Every server I'm aware of that does
that also supports the actual verb as well.

------
tszming
No one remembered the conversations between Ryan Dahl and Zed Shaw on the http
parser two years ago?

[1]
[https://news.ycombinator.com/item?id=2549403](https://news.ycombinator.com/item?id=2549403)

[2]
[https://twitter.com/zedshaw/status/15714602817](https://twitter.com/zedshaw/status/15714602817)

~~~
j_s
Interesting how temporary Twitter winds up being... most of the links on the
Hacker News discussion are now dead!

------
bslatkin
Yet another reason to remove verbs from the HTTP spec:
[http://www.onebigfluke.com/2013/08/lets-remove-verbs-from-
ht...](http://www.onebigfluke.com/2013/08/lets-remove-verbs-from-http-20.html)

~~~
iso-8859-1
Let's remove software, there are bugs in some of it.

~~~
PommeDeTerre
It's disappointing to see your kind of a response here.

It starts when somebody makes a legitimate observation that a very specific
set of functionality has been shown to have serious flaws in practice, and
then suggests that perhaps the best course of action is to remove this very
specific, and broken, functionality.

Then somebody else comes along, and responds like you did with a smart-ass
comment taking it to an overly-broad, unreasonable and stupid extreme. These
kinds of comments are useless.

Sometimes the best way to fix broken functionality is to remove it. That in no
way means it's the only possible solution, however. Nor does it mean that it
needs to be applied without bound.

~~~
pessimizer
>It starts when somebody makes a legitimate observation that a very specific
set of functionality has been shown to have serious flaws in practice, and
then suggests that perhaps the best course of action is to remove this very
specific, and broken, functionality.

I missed that. I saw a link that says that we shouldn't have verbs because no
one implemented them, LOL SPACEJUMP.

Do you really think that it's a "legitimate observation" that because node.js
has a shitty http parser, the parts of http that it parsed badly should be
thrown out?

I read two smart-ass comments here, and a civility troll.

------
lamby
Hm, didn't nginx do something similarly "ugly" for a while? Like inspecting
the (completely making it up here) second character to see if it's 'E'?

A quick look now suggests they are using a parser generator now.

~~~
Wilya
The nodejs source credits [0] as inspiration. It seems that nginx actually
does something like that: first, it switches on the length of the verb, then
it tries to compare the string (using the ngx_str6cmp() and similar macros) to
the ones it knows how to handle.

When the length of verb is 4, it checks if the second character is 'O' before
trying the full string comparisons, probably because there are 4 possibilities
with it (POST, COPY, MOVE, LOCK) vs only one without (HEAD).

It starts at line 180.

That being said, in nginx case, it's a _preliminary_ check, just to avoid
doing some full string comparisons in a very specific case. On the other hand,
the nodejs directly does a 'parser->method = HTTP_CONNECT' after checking one
character. It seems like it actually checks the other characters afterwards in
some cases, but this affectation is quite misleading.

[0]
[http://hg.nginx.org/nginx/file/abf7813b927e/src/http/ngx_htt...](http://hg.nginx.org/nginx/file/abf7813b927e/src/http/ngx_http_parse.c)

~~~
shortcj
Why not do a bit check optimization instead of looking at whole characters?

~~~
Wilya
I'm not an expert, but I don't think you can do pure bit checks in a modern
CPU. It likely does comparisons of char in a single hardware instruction
anyway.

Actually, if you look at the nginx source I linked, the comparisons of the
full strings are hidden behind macros because they do the opposite: they group
characters 4 by 4, and do 32bit int comparisons as much as possible.

~~~
shortcj
an immediate value stored in an instruction would double performance over an
l1 data cache reference... but this is stated just to mock people on this
thread who want 'almost' machine level logic to match robust user land
permutations.

------
smackfu
I've seen this kind of thing before. Someone falls in love with their
optimization ("we can parse GET with a single int switch!") and when it is
pointed out that it doesn't handle the spec correctly, and that handling the
spec correctly would make this as slow as the naive way, it still ends up in
the code because "it's good enough."

------
verroq
Why not use a trie?

~~~
kevingadd
Kind of beside the point, really - this kind of micro-optimization doesn't
actually improve performance to the extent you want, especially when you
consider the downsides.

You can improve HTTP throughput far more dramatically with higher-level
optimizations, like moving request parsing into the kernel or otherwise doing
things that address larger bottlenecks involved in getting HTTP requests from
the network card. The Windows version of Node performs a lot better if you
have it use Microsoft's kernel mode HTTP interfaces than if you use the crazy
micro-optimized HTTP support in Node that is responsible for this bug.

Ultimately, the prologue in a HTTP request - 'HTTP/1.1 GET' or whatever - is a
tiny string of bytes and parsing it, even using the most slow language on the
planet, really isn't that expensive. Optimizing that tiny bit of work isn't
going to produce enormous gains. Mostly just an ill-advised attempt at being
clever.

~~~
MichaelGG
Having written a high performance (1Gbps+ on 2nd gen i7, loopback) speed SIP
packet capture/parser, I'll just disagree that such optimization against a
text based protocol is not always a waste of time or an attempt at being
clever. Doing things like comparing strings against 4 or 8 byte integers can
show remarkable improvement, especially for the common methods.

HTTP (and SIP) have idiotically complex grammars for no reason other than the
authors getting all clever on us because it's a "text based" format. I'd
actually be shocked to find that HTTP is actually properly implemented in most
cases (fully handling comments and line folding, for example).

And then, even if you do implement it properly, you have to worry about other
software interpreting it differently. So Proxy A determines a request is
allowed and appends a header, but Server B reads the request in a different
way and violates policy.

------
muloka
I imagine this could this be used to kill a running instance of node.js in
production?

~~~
ssafejava
No, it won't kill the process; it will simply terminate the connection. It
looks like, from the C code, that it was intended to throw a 405 but somewhere
in the process it screws up and terminates early.

~~~
muloka
Ah thanks for clearing that up.

------
Xorlev
At least they're looking at fixing it.

~~~
myhf
Or at least they were looking at fixing it 8 months ago.
[https://github.com/joyent/http-
parser/pull/145](https://github.com/joyent/http-parser/pull/145)

------
escaped_hn
Clearly a case of premature optimization. What is the overhead of testing
equality on one char vs the entire string?

~~~
michaelmior
How is it clear? It's quite possible the author measured a naive
implementation and found bottlenecks there to optimize. If method parsing was
a bottleneck, I could see the overhead of string comparison being quite
significant. Single character equality is one instruction vs. a dozen or two
(including a loop) for string equality.

~~~
DannyBee
For string equality on most of these methods, it's not a dozen.

If your goal is really optimization, you'd do it as a series of vector
operations (GCC will let you do this in a platform independent way for almost
all of these methods), using mm_cmpeq_*, which will do 4 chars at a time, or 8
chars at a time with AVX/etc.

In reality, strncmp is likely to be a lot more optimized than you think, and
in newer glibc, it will do 16 chars at a time using SSE4.2/AVX/whatever (see
[http://repo.or.cz/w/glibc.git/blob/ba63ba088227987c8215e93f7...](http://repo.or.cz/w/glibc.git/blob/ba63ba088227987c8215e93f7fd3e9a4e08abce3:/sysdeps/x86_64/multiarch/strcmp-
sse42.S)).

The short answer is: I'd love to see benchmarks that show this as a real
bottleneck.

~~~
pcwalton
FYI nginx compares 32 bits at a time for HTTP methods on CPUs where unaligned
reads are supported. I imagine if anyone has benchmarked it it's nginx.

~~~
DannyBee
Again, maybe? It's very easy to do more than 32 bits at a time, even on CPU's
not supporting vector instructions.

