
Why events are a bad idea for high-concurrency servers (2003) [pdf] - gpderetta
https://people.eecs.berkeley.edu/~brewer/papers/threads-hotos-2003.pdf
======
pavlov
Worth noting this is from 2003. The performance concerns of event-based
servers have been greatly alleviated by both hardware and software
advancements.

The test setup used for this paper was a "2x2000 MHz Xeon SMP with 1 GB of RAM
running Linux 2.4.20". Solid PC server iron for 2003, but basically equivalent
to a $5/month server from Digital Ocean today.

If you're looking to squeeze 100,000 concurrent tasks from that $5 server,
this paper is relevant to you.

~~~
cryptonector
C10K is another name for evented I/O, and it's from the _90s_. By 2003 thread-
per-client was already obsolete and known to be very bad.

It's really quite simple: threads encourage the use of implicit program state
via function calls, with attendant state expansion (stack allocation), whereas
evented I/O encourages explicit program state, which means the programmer can
make it as small as possible.

Smaller server program state == more concurrent clients for any given amount
of memory. Evented I/O wins on this score.

But it gets better too! Smaller program state == less memory, L1/2/3 cache,
and TLB pressure, which means the server can take care of each client in less
time than an equivalent thread-per-client server.

So evented I/O also wins in terms of performance.

Can you write high-performance thread-per-client code? Probably, but mostly by
allocating small stacks and making program state explicit just as in evented
I/O, so then you might as well have done that. Indeed, async/await is a
mechanism for getting thread-per-client-like sequential programming with less
overhead: "context switching" becomes as cheap as a function call, while
thread-per-client's context switches can never be that cheap.

The only real questions are:

    
    
      - async/await, or hand-coded CPS callback hell?
      - for non-essential services, do you start with
        thread-per-client because it's simpler?
    

The answer to the first question is utterly dependent on the language
ecosystem you choose. The answer to the second should be context-dependent: if
you can use async/await, then always use async/await, and if not, it depends
on how good you are at hand-coded CPS callback hell, and how well you can
predict the future demand for the service in question.

~~~
afcapel
A context switch in a modern CPU takes only a few microseconds. A GB of RAM
costs less than $10. So those concerns, although valid in theory, are usually
irrelevant for most web applications.

On the other hand, simplicity in a code base usually matter. Code written with
an evented API, littered of callbacks, is usually harder to read and maintain
than that written in a sequential way with a blocking I/O API.

You can recreate a sync API on top of an evented architecture using
async/await, but then you have the same performance characteristics of a
blocking API, but with all the evented complexity lurking underneath and
leaking here and there. Seems to me a very convoluted way to arrive to the
point from where we started.

~~~
nine_k
A GB of RAM only costs less than $10 if you are buying for your unpretentious
gaming rig.

A GB of ECC server RAM costs more. An extra GB of RAM _in the cloud_ can even
cost you $10/mo if you have to switch to a beefier instance type.

~~~
gowld
$10/mo is far less than the cost of thinking about the issue at all.

~~~
cryptonector
Yes, but. Suppose you build a thread-per-client service before you realized
how much you'd have to scale it. Now you can throw more money at hardware,
or... much more money at a rewrite. Writing a CPS version to begin with would
have been prohibitive (unless you -or your programmers- are very good at
that), but writing an async/await version to begin with would not have been
much more expensive than a thread-per-client one, if at all -- that's because
async/await is intended to look and feel like thread-per-client while not
being that.

One lesson I've learned is: a) make a library from the get-go, b) make it
async/evented from the get-go. This will save you a lot of trouble down the
line.

------
dang
A thread from 2017:
[https://news.ycombinator.com/item?id=14548487](https://news.ycombinator.com/item?id=14548487)

2014:
[https://news.ycombinator.com/item?id=7684163](https://news.ycombinator.com/item?id=7684163)

2011:
[https://news.ycombinator.com/item?id=2907415](https://news.ycombinator.com/item?id=2907415)

Smaller but interesting threads from 2010-12:

[https://news.ycombinator.com/item?id=3482002](https://news.ycombinator.com/item?id=3482002)

[https://news.ycombinator.com/item?id=3101451](https://news.ycombinator.com/item?id=3101451)

[https://news.ycombinator.com/item?id=2910849](https://news.ycombinator.com/item?id=2910849)

[https://news.ycombinator.com/item?id=1547353](https://news.ycombinator.com/item?id=1547353)

[https://news.ycombinator.com/item?id=1355309](https://news.ycombinator.com/item?id=1355309)

[https://news.ycombinator.com/item?id=1269090](https://news.ycombinator.com/item?id=1269090)

------
saman_b
The results of our to be published paper clearly confirm the claims of this
paper and shows that if implemented well, threads can perform and scale as
good as events with not much memory overhead.

I worked on this subject during my Phd, and the result is a paper that will be
published in sigmetrics2020.

We developed an M:N user-level threading library and exhaustively tested it
against event-based alternatives and pthread based solutions.

We used both memcached and webservers to test it on 32 core and 64 core
servers.

Even connection/pthread looks promising in terms of performance.

You can find the paper and source files here:
[https://cs.uwaterloo.ca/~mkarsten/papers/sigmetrics2020.html](https://cs.uwaterloo.ca/~mkarsten/papers/sigmetrics2020.html)

~~~
zzzcpan
Well, no, you are just hacking it to support your claims. A split-stack work-
stealing implementation is ok, nothing special, basically goroutines. But you
are not addressing the most important difference between events and shared
memory multithreading - synchronizing access to shared memory. Idiomatic event
driven applications don't do synchronization and have to be sharded to scale
to multiple cores. Choosing memcached and running it multithreaded is
particularly bad, as memcached is not a decent event driven applications, it
mixes threads and events and suffers from all that synchronization overhead.
At least you should run it in one process per core configuration [1]. But it's
much worse than that, there is some serious research in this area that
addresses those problems, in particular the Anna paper [2], it kills any
possibility for shared memory multithreading as a concurrency model to compete
with anything, it's just too broken on a fundamental level.

[1] [https://github.com/scylladb/seastar/wiki/Memcached-
Benchmark](https://github.com/scylladb/seastar/wiki/Memcached-Benchmark)

[2]
[https://dsf.berkeley.edu/jmh/papers/anna_ieee18.pdf](https://dsf.berkeley.edu/jmh/papers/anna_ieee18.pdf)

~~~
saman_b
I believe one should not confine event-driven only to applications that don't
do synchronisation, that's part of the misconception that leads to thinking
event-driven has higher performance. This is due to the fact that part of the
problem with threads (as you mentioned) is synchronisation, but this is the
same problem with event-driven applications. We have a web server experiment
that shows comparable performance to an event-driven web server (ulib) with
its various hacks to make it faster and is always on the top list on tech-
empower.

Regarding memcached, the first reference you posted is from 2015, yes in 2015
memcached was in a very bad shape in terms of synchronisation and things have
significantly changed from that version with locks per hash bucket rather than
a global lock, avoiding try_lock and .... So those results are too old to rely
on. Seastar moves network stack to user-level and if I remember correctly the
scheduler consisted of multiple ever looping threads even if there was no work
to do. Considering in our experiments 40-60% of the memcached overhead were
coming from network I/O, there is no surprise about their results. I would
call this a hack for sure.

I have not read the Anna paper to be able to comment, but it seems that they
are creating a key-value store using the actor model. I briefly schemed
through the paper, and I did not find anything that points to "killing any
possibility for shared memory multi-threading as a concurrency model"; this is
a very bold claim and if they do claim that, they should have really strong
results.

But my guess is anna's whole actor model was implemented on top of threads and
shared memory multi-threading? Which will be in contrast of that bold claim. I
worked with actor models and implemented one as well, it is a perfect fit for
many use cases but threads at least for now are the bread and butter of
multicore programming.

Having said that, all these models have their respective place in the software
world. What we are trying to show in our paper , through thorough experiments,
is that the misconception that event-driven has higher performance than thread
programming is not fundamentally true. Therefore, falling into asynchronous
programming and create hard to maintain applications [1] only due to better
performance has no merit.

[1] [https://cacm.acm.org/magazines/2017/4/215032-attack-of-
the-k...](https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-
microseconds/fulltext)

------
the_unproven
Events can handle much more connections than a thread based approach. For
example nginx is implemented with event-driven architecture:
[https://www.nginx.com/blog/inside-nginx-how-we-designed-
for-...](https://www.nginx.com/blog/inside-nginx-how-we-designed-for-
performance-scale/)

> NGINX scales very well to support hundreds of thousands of connections per
> worker process. Each new connection creates another file descriptor and
> consumes a small amount of additional memory in the worker process. There is
> very little additional overhead per connection. NGINX processes can remain
> pinned to CPUs. Context switches are relatively infrequent and occur when
> there is no work to be done.

> In the blocking, connection‑per‑process approach, each connection requires a
> large amount of additional resources and overhead, and context switches
> (swapping from one process to another) are very frequent.

This is their explanation on events vs threading approach. Still, a lot of web
servers today use a thread-per-connection which is acceptable since a
database(e.g. postgres) performance degrades slowly as more active connections
are introduced.[1]

[1] [https://brandur.org/postgres-connections](https://brandur.org/postgres-
connections)

------
Roboprog
I guess it depends.

I guess I’m somewhat used to the asynchronous I/O model that things like Win16
and JavaScript use. You have to think a bit more about the unpredictable order
that things happen in, but not about preemption race conditions and data
corruption.

The only threading model I have seen that I care for is Erlang’s. The C ...
Java model is a data corruption death trap.

At that, Erlang kind of straddles the gap between independent execution and
communicating sequential processes.

~~~
Roboprog
Oh wait, the article is about performance, not safety. Real programmers don’t
care about reliability, only if it’s performant at interweb scale.

Meh.

------
axismundi
and then there was that just two days ago:
[https://news.ycombinator.com/item?id=22165193](https://news.ycombinator.com/item?id=22165193)

~~~
TheFiend7
Haha I think that's the point of OP's post IMO, and people were even
discussing this dichotomy in that thread as well.

Total gold, nothing in this world is clean or definitive.

~~~
gpderetta
Yes :). I do not actually have a strong opinion on the
Event/Threads/Coroutines debate. I think they should all be considered on a
case by case basis.

~~~
bullen
I'll restate the same thing I commented on that post:

[https://news.ycombinator.com/item?id=22174201](https://news.ycombinator.com/item?id=22174201)

People only now start to understand that the problem with multi-core is memory
speed and the only way to solve that is by using a virtual machine with a
concurrent capable memory model = Java.

In-lined memory does not play well with multiple threads. Cache-misses is the
way to parallelize code!

------
7532yahoogmail
Threads and events are duals. So it is possible to do both well if, in the
threads case, the implementation is done right. The author makes the case he's
done implementations well. The salient question then is: can the average
programmer do threads based concurrent programming well? As dual, I think we
can all agree that more effort/intelligence will lead to better solutions for
threads or events.

------
mcguire
This paper is a followup to:

* J. K. Ousterhout. "Why Threads Are A Bad Idea (for mostpurposes)". ([http://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf](http://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf))

* V. S. Pai, P. Druschel, and W. Zwaenepoel. "Flash: An Efficientand Portable Web Server". ([https://www.usenix.org/legacy/events/usenix99/full_papers/pa...](https://www.usenix.org/legacy/events/usenix99/full_papers/pai/pai.pdf))

* M. Welsh, D. E. Culler, and E. A. Brewer. "SEDA: An architecturefor well-conditioned, scalable Internet services". ([http://www.sosp.org/2001/papers/welsh.pdf](http://www.sosp.org/2001/papers/welsh.pdf))

Those are some of the more influential papers in the development of event-
driven designs.

------
PaulHoule
Circa the time that paper was written there was a lot of talk about "single-
process" web servers. I was working on a web site from which people downloaded
files and the founder was concerned that Apache used too much memory for that
kind of request (we had many dial-up users) and we tried a few of the open
source event-based web servers and at the time it seemed that many of them
would corrupt data -- we would get all sorts of reports about it from users.

------
dirtydroog
Didn't get past the opening paragraph. Sure, this is from 2003 but they're
wrong. Look at redis.

Thread-per-connection is just plain dumb. Threads aren't free. Context-
switching is a thing. Modern advice is to try not have more threads than
cores. Thread-local storage is a very useful, having 1000's of thread may make
TLS unfeasible.

------
gpderetta
The paper is arguing in favour of userspace threads or M:N threading, which
were fairly popular in the late '90s, before falling out of favour. Then got
popular again with go.

------
alexfromapex
NodeJS seems to be doing alright with its event loop.

------
patcoll
I might add the caveat from the title to this HN submission: “(for high-
concurrency servers)”

And mention 2003 :)

~~~
dang
Added. Thanks!

------
adamnemecek
No, events are actually a spectacular idea.

