

Show HN: Hellepoll is a blazingly-fast async HTTP server written in C++ - willvarfar
https://github.com/williame/hellepoll
This is based on an approach I made for a very high performance message and video multiplexing server.<p>The project died, but I'm glad I get to share the a basic version with some http test code now.<p>I admit this is the second time I've tried to show this off here today; I posted earlier but it got lost when no-one was watching :(
======
k33n
Cool stuff. I assume you wrote this for your own personal gratification and
not to compete with Nginx or something. What was the coolest thing you learned
in putting this together?

------
apaprocki
You should not be #include-ing headers inside of an extern "C" block. It is up
to those headers to:

    
    
      #ifdef __cplusplus
      extern "C" {
      #endif

~~~
humbledrone
And, if those headers do not already wrap themselves in their own extern "C"
block, wrapping around the #include is sometimes a quick fix (e.g. until the
upstream authors merge your pull request).

~~~
DiabloD3
This.

Windows porters frequently have to do this to deal with the fact Microsoft
still refuses to produce a C compiler, and instead just does double duty on
their existing C++ compiler.

------
tewks
I'm curious as to why the author hasn't referenced nginx, which is written
also event based (epoll/kqueue) and written in C.

~~~
givan
Probably because is much slower than nginx, C++ adds extra "fat" that nginx
(writen in C) doesn't have.

~~~
sigil
Not sure about C++ overhead there. C++ often edges out C in the alioth
benchmarks. More to the point, I highly doubt the author's parser ("parser")
in http.cpp is faster than the one in nginx, which really is a thing of
beauty.

[http://trac.nginx.org/nginx/browser/nginx/trunk/src/http/ngx...](http://trac.nginx.org/nginx/browser/nginx/trunk/src/http/ngx_http_parse.c)

~~~
kingkilr
I really love the trick for efficiently reading 4 chars and checking them out
of a string. I like it so much I've been working towards making it happen
automatically on PyPy so if you write something like:

    
    
        if buf[i:i+4] == "POST":
    

the JIT automatically turns that into a MOVL + CMP + JMP. The magic of high
level languages :)

------
kqueue
I believe you are testing the requests / sec and not replies / sec. While
requests / sec matters a bit, you'll most probably be bottle-necking in your
context switches. What matters is replies / sec. This is a more accurate
measurement of your server performance and not OS bottlenecks. I'd use httperf
for this matter.

~~~
willvarfar
I am confused. What context switches?

~~~
kqueue
Measuring requests per second is close to measuring the number of accept
system calls you can do. Since it's a system call you have a context switch.

~~~
jerf
I would love to see "someone" do that benchmark; using epoll, accept as many
sockets as possible, for the sake of argument let's stuff one byte down them,
and then properly shut them down.

I'm curious how close to the limit of the underlying poll technology we're
getting, because it seems like everyone is converging in roughly the same
area, in an order-of-magnitude sense.

I also find myself wondering about the implications of these artificial
benchmarks and the stuff Zed discussed here:
<http://sheddingbikes.com/posts/1280829388.html>

~~~
kqueue
request rate is not influenced much by the polling mechanism
(epoll/kqueue/poll/select) when all you're doing is listening to one fd and
processing new ones then closing then immediately. These multiplexers matter
when you are working on lots of file descriptors.

It matters because select will have to iterate over all the file descriptors
you passed to it, while kqueue for example have knotes registered to it when
it wakes up and won't need to iterate over everything, but just the knotes it
got. Not to mention the data copying from userland to the kernel in case of
select/poll.

Back to the request rate limitations, the limitation is mainly coming from the
# of system calls you can execute in a sec (accept() in this case) which is
heavily influenced by the context switches.

Looking at the application CPU gives you a very good idea of what your
bottleneck is. If your CPU is 100% then it's clear your application is hitting
the limit, but if the application CPU is at 30% and you cannot process more
requests / sec then you've hit the system limit here.

~~~
willvarfar
Right, there is indeed a mode switch to read/write from/to the tcp buffer or
accept a connection. Luckily mode switches are not context switches - which
are typically massively more expensive - as only one thread is ever involved.

Hellepoll uses 'epoll' which is the Linux equivalent of kqueue. Kqueue is said
to be marginally faster still, and I look forward to Hellepoll using kqueue
when running on FreeBSD. Epoll/kqueue absolutely affect accept rate
incidentally, so its an exercise to the reader to work out why ;)

Hellepoll does use the fanciest features of epoll like 'edge triggering' which
is perhaps one of the things making it nudge ahead of Java's NIO-based
webservers (as NIO is lowest-common-denominator and lacks ET).

Finally, Hellepoll is really writing meaningful bytes, and it even flushes
them in keep-alive connections (obviously; think how it would ever work
otherwise?).

So I think, on balance, Hellepoll is the real thing and not just measuring how
quickly a big backlog on a listening socket can fill up ;)

On Linux, I've found the new 'perf timechart' a lot of fun.

~~~
kqueue
Network IO, Disk IO, scheduling, locks etc.. all trigger context switches and
not only mode switch because only the kernel is allowed to manipulate data
structures related to mbufs, vfs, and whatnot.

When I said epoll/kqueue doesn't affect the accept rate, I was replying back
to a specific request of writing a program to just accept/reply/shutdown. In
this case you are passing one fd to the poller which won't matter much what
poller you are using.

My original comment is that you are measuring the wrong thing and it's not
against hellepoll. requests / sec can be much higher than reply / sec because
they get buffered waiting for you to accept them. Once accepted, then a
request is counted. What matters though, especially to the http client is how
long it takes to serve the connection from start to finish, hence reply rate.

fyi, you can use libev for poller portability.

You can experiment with this command and look at the reply rate.

httperf --num-conns=10000000 -vv --num-calls=1 --port=<your_port>

~~~
willvarfar
1) yes I've used libevent and libev and others in the past

2) ab does wait for the requests to complete before sending the next

3) you really are flat wrong when you don't make a distinction between mode
switching and context switching

4) conclusion: with a name like yours, you must be trolling

~~~
kqueue
I didn't say there is no distinction. I said system calls that require disk
io, or network io, or triggers locking/sleeping requires a context switch.
Calls like read/write/accept triggers context switches (which starts with a
mode switch). The switch is required for the kernel to execute the system
calls and operate on it's own data structures. Only the kernel can alter mbufs
in this case.

Go read your OS book.

I am not trolling. you just lack experience and this sounds new to you.

~~~
willvarfar
Thank you for making me challenge my assumptions and memory.

I've asked around a bit and am fairly sure of my facts again, and my
understanding is:

* kernel mode is cheaper than ever to reach; SYSCALL/SYSENTER etc so its not even an interrupt and there are no hardware threads or anything involved

* in kernel mode, the thread can get straight at the buffer and the locks that protect it; there is nothing that we'd call a 'context switch' in there

* being as this seems to be what is meant by monolithic kernel, surely its the same on freebsd too?

~~~
kqueue
I totally agree that SYSCALL/SYSENTER/SYSRET are very cheap to execute. But
these instructions only takes care of the ring switch and are not executed
alone.

When you make a system call, a trap is issued that causes the hardware switch
to kernel mode.

The hardware pushes onto the per-process kernel stack the pc, status word, and
the kernel code takes care of saving the registers, esp, etc.. this is called
a task context switch. Context switching between processes is much more
expensive but task context switching is still considered a context switch.

When you are making a system call, it's still much more expensive then most of
the work you are doing in your program hellepoll and hence it's your
bottleneck. This is why you don't see your process's CPU at 100%.

On a related note, whenever you have a program doing a lot of network IO, you
are essentially causing a lot of process context switches because each time
you get data on the wire you cause a context switch because the kernel needs
to handle this hardware interrupt.

~~~
willvarfar
So we only disagree in edge-case terminology. I think your trying to worm out
of your missclassification, but no worries.

Yes, to get these numbers i have had to minimise syscalls. Thats the advantage
of hellepoll. I wrote the http server just to release it, as before it was an
rtmp server but it was commercial and couldn't be released. That had write
buffers usr side too which helped even more.

And i understand what ab and httperf test, and yes i am counting served pages.

Finally, i spend a lot pf time staring at linux perf reports and timecharts.

------
enduser
ULib is a mature and well-considered C++ framework for developing high-
performance applications that includes a blazingly-fast async HTTP server--
more batteries included.

<https://github.com/stefanocasazza/ULib>

------
pilooch
How does it compare to libevent evhttp server? Besides being written in C++...

------
WALoeIII
I immediately wanted to fork this and have it just return "Allan!" instead of
"Hello World".

<http://www.youtube.com/watch?v=xaPepCVepCg>

