
Lwan: Experimental, scalable, high performance HTTP server - tombenner
https://github.com/lpereira/lwan
======
mmastrac
I had to Google "rebimboca da parafuseta" out of curiosity. Apparently it's a
Brazilian term roughly analogous to "reticulation of the splines".

~~~
acidx
Author here. This gave me a chuckle. :)

And, yes, that's pretty much it, although "rebimboca da parafuseta" is less
obscure than "reticulation of the splines" (if you're Brazilian, anyway).

It is used to denote a fictitious part, which name or function are unknown, of
a car engine or any other machine. It was coined in the 70s TV show, and later
used in some ads aired during the same period; since then, it's an expression
used for humorous effect. It's not very common these days but is unlikely
you'll meet someone down here that never heard it.

~~~
mutagen
Aha, not unlike the retro-encabulator:

[https://www.youtube.com/watch?v=RXJKdh1KZ0w](https://www.youtube.com/watch?v=RXJKdh1KZ0w)

~~~
emmelaich
Which is of course a riff on the original turbo-encabulator:
[https://www.youtube.com/watch?v=Ac7G7xOG2Ag](https://www.youtube.com/watch?v=Ac7G7xOG2Ag)

------
codingbeer
For more information on some of the C magic behind this well-written piece of
software, check out the authors blog[1]. It should be pretty interesting for
any systems programmer.

[1]:
[http://tia.mat.br/blog/html/index.html](http://tia.mat.br/blog/html/index.html)

------
nwmcsween
A few issues:

* rawmemchr - it _might_ be faster as it doesn't have to decrement the size_t but this only mildly relevant for lwan as many instances of rawmemchr use are simply rawmemchr(ptr, '\0') which is exactly the same as ptr + strlen(ptr) + 1 and even less optimized.

* pthread_tryjoin_np - __linux__ is defined by gcc not glibc, you should check for __GLIBC__ if you want to use glibc specific functions.

* underscore prefixed functions - pedantic I know but it is reserved for the implementation.

~~~
acidx
These are easy things to fix: feel free to issue a few pull requests. :)

Regarding rawmemchr(): both are pretty well optimized. Both are implemented in
glibc using the same technique (reading a byte at a time until it is aligned,
then moving to multibyte reads). strlen() might be faster, yes, considering
that the implementation can hardcode some magic numbers. In other words: some
micro benchmarks might help decide here.

Regarding __linux__ vs. __GLIBC__: Lwan works with some alternative libcs
(such as uClibc), so relying on __GLIBC__ being defined for things like this
doesn't seem like a good idea. In any case, since Lwan isn't portable anyway,
one can just assume it is always running on Linux and get rid of these
#ifdefs.

------
e12e
Any relationship to G-WAN[g]? I can't see any mention of it in the readme, or
on the web page (perhaps I didn't look hard enough) -- but there appears to be
some resemblance? (Use as a C web framework work-a-like, fast webserver etc)?

[g][http://gwan.com/](http://gwan.com/)

~~~
acidx
Apart from sharing the "wan" suffix and choice of language, there's no
relationship whatsoever.

------
hendzen
"Hand crafted HTTP request parser" \- hard to see how this really faster (and
less bug prone) than generating one with Ragel.

~~~
nly
Much of the advantage of using a DFA generator like Ragel is lost because the
HTTP header grammar is actually ambiguous in several places and can't be
streamed. You can use it as a component, but it isn't entirely sufficient on
its own.

The HTTP 1.1 RFC requires whitespace be stripped at the end of header values,
yet also permits (although deprecates) header folding, giving rise to the
following ambiguity (using _'s in place of leading spaces):

    
    
        Foo: Hello\r\n
        ______\r\n
        ________\r\n
        ________ world!\r\n
        Bar: smeg
    

This requires that the parser buffer all the whitespace between 'Hello' and
'world!' (and the RFC doesn't put a standard limit on header value length)
just in case 'world!' never comes and the value of the Foo header has to be
stripped back to just "Hello"

Here's a related observation from a commit[0] by the Joyent guys, who wrote
the streaming parser used by NodeJS: "For http-parser itself to confirm[sic]
exactly would involve significant changes in order to synthesize replacement
SP octets. Such changes are unlikely to be worth it to support what is an
obscure and deprecated feature"

Another example is parsing the Request and Status lines:

    
    
        GET <uri> HTTP/1.1
    

Technically _< uri>_ can't contain spaces, but the RFC says you MAY accept
them [RFC7230: 3.5. Message Parsing Robustness] ... which then gives rise to
the possibility of _< uri>_ containing the literal string " HTTP/1.1", and
ultimately opens up bad user agents that send spaces to header injection.

Resolving these ambiguities require implementing your own buffering, and
dropping down to Ragels 'state charts' feature to avoid your semantic actions
being munged... which leaves you to design the top level state machine
yourself.

[0] [https://github.com/joyent/http-
parser/commit/5d9c3821729b194...](https://github.com/joyent/http-
parser/commit/5d9c3821729b194eef60f41fcc5f8b4657c3d8ff)

~~~
oever
[https://www.ietf.org/rfc/rfc2616.txt](https://www.ietf.org/rfc/rfc2616.txt) :
"All linear white space, including folding, has the same semantics as SP." So
only one space need be counted and any subsequent space or horizontal tab can
be ignored. Your example simplifies to:

Foo: Hello World\r\n Bar: smeg

~~~
nly
It doesn't matter that the obs-fold can be treated as a single space because
you don't know that until you've actually reached the fold. Consider:

    
    
        Foo: Hello_______world\r\n
    

Here the whitespace has to be preserved, as per the _field-content_
production.

So RFC2616 allowed:

    
    
        Foo: Hello_______\r\n
        ____world\r\n
    

to be reduced to:

    
    
        Foo: Hello world\r\n
    

Incidentally RFC7230 (it's successor) says something subtly different:

"A server that receives an obs-fold in a request message that is not within a
message/http container MUST either reject the message by sending a 400 (Bad
Request), preferably with a representation explaining that obsolete line
folding is unacceptable, or replace each received obs-fold with one or more SP
octets prior to interpreting the field value or forwarding the message
downstream."

So now it's been loosened to one _or more_ SP octets... useful, except there's
no mention of _TAB_ octets, which can follow a CRLF as part of an obs-fold...
so you still can't just remove the CRLF and preserve _all_ the whitespace...
preserving tabs would be illegal. So the new rules don't help streaming
either. Joyents parser did this regardless (not sure if it still does).

You'll also notice it says the _obs-fold_ can be replaced with a sequence of
SP characters... according to the grammar that's the CRLF and the _following_
whitespace, _not_ any whitespace _preceding_ the CRLF. You'd think that would
be helpful because it means any whitespace before the CRLF can always be
streamed as part of the field value, right? Except...

    
    
        Foo: Hello_______\r\n
        ____\r\n
    

would still simplify to:

    
    
        Foo: Hello_______[one or more SP octets]\r\n
    

and then what? Well, presumably you then have to trim off the trailing
whitespace to produce a value of just "Hello"... so all the whitespace you
buffered before you reached the CRLF (in case you reached another _field-
vchar_ like the 'w' in 'world') has to be discarded.

~~~
Retric
Assuming your reading this from a buffer and most of the time you don't need
to count then store an index to the o in hello followed by ignoring space.
Then to back and count it only if you need to. Or better yet if whitespace is
not need just copy a space but if it's needed then move those bytes to the
next stage and don't bother to count them that go around.

~~~
nly
... and what happens if your header happens to run off the end of your buffer
amongst the spaces that follow 'Hello'? Cookies can be huge, I can construct a
header like 'Cookie: x[8000 spaces]y'

~~~
Retric
This get's into the specifics so you almost need to know the code your dealing
with to make a relevant suggestion.

However, the easy option is to default to the slow code path on edge cases
which considering it's probably a rare enough it's not important to make fast
as long it's bug free. IMO, optimizations are always a balancing act between
minimizing the computer's effort and minimizing the coders effort while trying
to maintain long term readability. But, you can always keep track of more than
one buffer so the option is there.

------
djcapelis
> "Hand-crafted HTTP request parser"

Uh oh.

------
zongitsrinzler
How does this compare to Nginx?

~~~
sauere
Pretty good, i'd say about 50% faster on raw request speed. Anyway, it isn't a
fair comparison given that nginx's feature set is much larger.

------
nwmcsween
Nice, benchmarks vs the competition would be interesting as well. Also with
10K+ idle connections I would be more worried about kernelspace memory
requirements (maybe recommended sysctl.conf changes).

~~~
donavanm
Nah, its a couple KB per connection. the biggest consumer would be the tcp
socket control structs and associated data buffers. Ball park 1.5KB for the
structs and another 4-16KB for tcp buffers on a typical internet tcp
connection.

~~~
nwmcsween
but vars controlling how long a tcp sock is held for or if they are reused is
controlled my the kernel.

~~~
donavanm
I think youre talking about timewait states et al. On linux thats dominated by
the MSL, which is a compile time constant of 60 seconds. You mentioned
sysctls, those are primarily tcp_tw_reuse and tcp_tw_recycle (which is the
worlds worst sysctl). Regardless, its a couple KB per connection. How many
hundred thousand do you want to support?

------
abionic
Can't spot an _OSS license_.

What license is aimed for it?

~~~
acidx
GPLv2 (or later) at the moment. Might change to LGPLv2 (or later) soon,
though.

~~~
CCs
Any chance for MIT or BSD?

We can't use GPL or LGPL at the company, everything is statically linked.

~~~
acidx
No chance for MIT or BSD, although LGPL with a static linking exception might
be doable.

------
jagger27
Any roadmap for HTTP2 support?

~~~
acidx
Not planned ATM. Still need to read more about it before giving it a go.

------
imaginenore
Nginx can do 500K to 1 million req/s

[http://lowlatencyweb.wordpress.com/2012/03/26/500000-request...](http://lowlatencyweb.wordpress.com/2012/03/26/500000-requestssec-
piffle-1000000-is-better/)

So can Google's compute engine on a $10 instance:

[https://news.ycombinator.com/item?id=6804897](https://news.ycombinator.com/item?id=6804897)

[http://googlecloudplatform.blogspot.ca/2013/11/compute-
engin...](http://googlecloudplatform.blogspot.ca/2013/11/compute-engine-load-
balancing-hits-1-million-requests-per-second.html)

~~~
bhauer
The tests at lowlatencyweb.wordpress.com were conducted without network
connectivity—the load generator (wrk) was running on the same host as the web
server. The results are 500K RPS for localhost connections with standard keep-
alive and 1M RPS for localhost connections with pipelining. This is using a
server with 24 HT cores and it's not clear to me what the response body was.

Google's Compute Engine test was using 200 virtual servers, but it does
include network connectivity. The response body is a single byte. Their blog
entry is a celebration of the performance of their load balancer more than a
statement about the performance of each VM.

In March, we were able to exceed 1M requests with network connectivity and
without pipelining to a single server [1]. Our project is not testing static
web servers, so we don't test with plain nginx; but I expect nginx would also
exceed 1M RPS in this hardware environment. This was using a server with 40 HT
cores and a single-byte response body.

Similarly, a highly tuned web server such as OP's Lwan should be expected to
exceed 1M RPS (network-connected) on a 40 HT core server. 1M RPS with small
response payloads is fairly easy on modern hardware.

Incidentally, we see 6M+ RPS with pipelining in our Round 9 plaintext results
[2].

[1] [http://www.techempower.com/blog/2014/03/04/one-million-
http-...](http://www.techempower.com/blog/2014/03/04/one-million-http-rps-
without-load-balancing-is-easy/)

[2]
[http://www.techempower.com/benchmarks/#section=data-r9&hw=pe...](http://www.techempower.com/benchmarks/#section=data-r9&hw=peak&test=plaintext)

~~~
imaginenore
If you want to go further, you need to get rid of the OS:

40 million req/sec with Lua

[http://highscalability.com/blog/2014/2/13/snabb-switch-
skip-...](http://highscalability.com/blog/2014/2/13/snabb-switch-skip-the-os-
and-get-40-million-requests-per-sec.html)

~~~
justincormack
Thats not a web server, its doing packet processing, which is a different
problem. You could connect a web server to it, but that is not a benchmark of
that.

