Hacker News new | past | comments | ask | show | jobs | submit login
Why events are a bad idea for high-concurrency servers (2003) [pdf] (eecs.berkeley.edu)
115 points by gpderetta on Jan 30, 2020 | hide | past | favorite | 45 comments



Worth noting this is from 2003. The performance concerns of event-based servers have been greatly alleviated by both hardware and software advancements.

The test setup used for this paper was a "2x2000 MHz Xeon SMP with 1 GB of RAM running Linux 2.4.20". Solid PC server iron for 2003, but basically equivalent to a $5/month server from Digital Ocean today.

If you're looking to squeeze 100,000 concurrent tasks from that $5 server, this paper is relevant to you.


C10K is another name for evented I/O, and it's from the 90s. By 2003 thread-per-client was already obsolete and known to be very bad.

It's really quite simple: threads encourage the use of implicit program state via function calls, with attendant state expansion (stack allocation), whereas evented I/O encourages explicit program state, which means the programmer can make it as small as possible.

Smaller server program state == more concurrent clients for any given amount of memory. Evented I/O wins on this score.

But it gets better too! Smaller program state == less memory, L1/2/3 cache, and TLB pressure, which means the server can take care of each client in less time than an equivalent thread-per-client server.

So evented I/O also wins in terms of performance.

Can you write high-performance thread-per-client code? Probably, but mostly by allocating small stacks and making program state explicit just as in evented I/O, so then you might as well have done that. Indeed, async/await is a mechanism for getting thread-per-client-like sequential programming with less overhead: "context switching" becomes as cheap as a function call, while thread-per-client's context switches can never be that cheap.

The only real questions are:

  - async/await, or hand-coded CPS callback hell?
  - for non-essential services, do you start with
    thread-per-client because it's simpler?
The answer to the first question is utterly dependent on the language ecosystem you choose. The answer to the second should be context-dependent: if you can use async/await, then always use async/await, and if not, it depends on how good you are at hand-coded CPS callback hell, and how well you can predict the future demand for the service in question.


A context switch in a modern CPU takes only a few microseconds. A GB of RAM costs less than $10. So those concerns, although valid in theory, are usually irrelevant for most web applications.

On the other hand, simplicity in a code base usually matter. Code written with an evented API, littered of callbacks, is usually harder to read and maintain than that written in a sequential way with a blocking I/O API.

You can recreate a sync API on top of an evented architecture using async/await, but then you have the same performance characteristics of a blocking API, but with all the evented complexity lurking underneath and leaking here and there. Seems to me a very convoluted way to arrive to the point from where we started.


A function call takes less. And using more RAM == thrashing your caches more == slowing down. The price of RAM isn't relevant to that -- this isn't about saving money on RAM but saving cycles. Yes, yes, that's saving money per-client (just not on RAM), but you know, in a commoditized services world, that counts, and it counts for a lot.


A GB of RAM only costs less than $10 if you are buying for your unpretentious gaming rig.

A GB of ECC server RAM costs more. An extra GB of RAM in the cloud can even cost you $10/mo if you have to switch to a beefier instance type.


How much does a MB of L-n cache cost?

I don’t have the answer, but you would want to measure dollars to buy it, and nanoseconds to refill it.


That's true, if you're buying OEM ram for Dell or HP servers, it's more like $10-20/GB. However you can buy Crucial ECC DDR4 ram for $6/GB, so there's a hefty OEM markup.


$10/mo is far less than the cost of thinking about the issue at all.


Yes, but. Suppose you build a thread-per-client service before you realized how much you'd have to scale it. Now you can throw more money at hardware, or... much more money at a rewrite. Writing a CPS version to begin with would have been prohibitive (unless you -or your programmers- are very good at that), but writing an async/await version to begin with would not have been much more expensive than a thread-per-client one, if at all -- that's because async/await is intended to look and feel like thread-per-client while not being that.

One lesson I've learned is: a) make a library from the get-go, b) make it async/evented from the get-go. This will save you a lot of trouble down the line.


It's actually a big problem for web servers. If you consider apache for example, that has to do one thread per connection. (yes, apache still doesn't support events for websockets in 2020).

Let's say you configure it for 2000 max connections (really not much) so that's 2000 threads, so 20 GB of memory right away because the thread stack is 10 MB on Linux. It's a lot of memory and it's obliterating all caches.

You can reduce the thread stack to 1 MB (might crash if you go lower) but any caching is still trashed to death.

Next challenge. How do you think concurrency work on the OS with 2000 threads? Short answer is not great.

The software making heavy use of shared memory segments, semaphores, atomics and other synchronization features. That code is genuinely complex, worse than callbacks. Then you're having issues because these primitives are not actually efficient when contended by thousands of threads, they might even be buggy.


What's wrong with the Apache event worker?

https://httpd.apache.org/docs/2.4/mod/event.html


It's not quite event based really. It still requires one thread per connection (websocket).


Ah I see you've dipped your toes into the Sea of Apache too. Horrible software. Should have died in 2000.


Evented has some difficulties in practice:

* On Unix, only socket IO is really evented; this can be solved by using a thread pool (the Flash paper describing this problem and solution for httpds dates to 1999[1]) but is inelegant. This is the approach Golang takes behind the scenes.

* How to scale event loop work across cores; this can be solved, but adds complexity. In contrast, threads are quite simple to scale.

* How to share accept() workload across cores; this can be solved, too, but not portably. I.e. your 100 core server may very well be accept-limited if you have a single-core accept loop, or highly contend on a single socket object.

* Threadpools are definitely inappropriate if you have a ton of relatively idle clients (long-poll server, or even typical webserver), but are less wasteful if you have relatively few, always-busy clients.

I don't disagree that the synchronous threadworker design isn't the best choice for high performance services that can afford a lot of engineering time to design and find all the bugs. But thread-per-client is often a completely acceptable place to start.

[1]: https://www.usenix.org/events/usenix99/full_papers/pai/pai.p...


These are minor things, and not really accurate, and at any rate, not relevant to the point that evented I/O == explicit, thus easy-to-minimize program state, while thread-per-client == program state expansion and costlier context switches.


I don't normally care about downvotes, but if I give out a nice explanation and you don't like it, it'd be nice to get a reply. Was my response wrong? How? I might learn something from your response.


> The performance concerns of event-based servers have been greatly alleviated by both hardware and software advancements.

Threads haven't exactly stood still in that time either, especially if you include green threads, fibers, coroutines, etc.

> If you're looking to squeeze 100,000 concurrent tasks from that $5 server, this paper is relevant to you.

It's relevant regardless, as part of a long-running back and forth between threads and events. For example, Eric Brewer was one of the co-authors of this paper, but also for the SEDA paper which was seminal in promoting event-based programming. I highly doubt that we've seen the last round of this, as technology on all sides continues to evolve, and context is a good thing to know. Those who do not learn the lessons of history...


I don't think there is back and forth anymore. Actual high performance research (e.g. when the cost of a single mutex is more than the whole CPU budget for processing something, like say a packet) has been devoid of threads since they got into the mainstream, so like for almost two decades already. They are still used, because this is what hardware and OS provide to do something on each core, but not for concurrency or performance.


When your per-request time is so short, using events is easy. You don't have to worry about tying up a poller thread. (And yes, even event-based servers use threads if they want internal concurrency.) But that's a specialized domain. If requests take a substantial amount of processing or disk I/O time, naive approaches don't work. You can still use events, in a slightly more sophisticated way, if literally everything below you does async reasonably well, but any advantage over threads is much less and sometimes the situation still reverses. I work on storage servers handling multi-megabyte requests, for example, and in that milieu there is still very much back and forth.


Sure, if you are devoting the whole computer to a single microbenchmark, threads a terrible idea. This is not necessarily the case when you have many heterogeneous applications running on a machine, though.


I'm a little confused by this- if you have multiple independent apps on a machine, would they not already be in separate OS processes?


Sure, but they still use the same kernel scheduling that threads do, and careful optimizations relying on core count = thread count are going to be basically worthless as well.


I mean the domain you're in changes your requirements drastically. Heck certain workloads (high frequency, low latency, nearly no concurrency) might run better on a single core of an overclocked i7 vs any Xeon processor.

If it's a web server then obviously you want more cores, and an event driven server would make a lot of sense.

Basically if you need concurrency then you want events, if you're compute bound than you don't want that overhead.

EDIT: Instead of i7 just imagine any high end (high frequency and IPC) consumer chip



The results of our to be published paper clearly confirm the claims of this paper and shows that if implemented well, threads can perform and scale as good as events with not much memory overhead.

I worked on this subject during my Phd, and the result is a paper that will be published in sigmetrics2020.

We developed an M:N user-level threading library and exhaustively tested it against event-based alternatives and pthread based solutions.

We used both memcached and webservers to test it on 32 core and 64 core servers.

Even connection/pthread looks promising in terms of performance.

You can find the paper and source files here: https://cs.uwaterloo.ca/~mkarsten/papers/sigmetrics2020.html


Well, no, you are just hacking it to support your claims. A split-stack work-stealing implementation is ok, nothing special, basically goroutines. But you are not addressing the most important difference between events and shared memory multithreading - synchronizing access to shared memory. Idiomatic event driven applications don't do synchronization and have to be sharded to scale to multiple cores. Choosing memcached and running it multithreaded is particularly bad, as memcached is not a decent event driven applications, it mixes threads and events and suffers from all that synchronization overhead. At least you should run it in one process per core configuration [1]. But it's much worse than that, there is some serious research in this area that addresses those problems, in particular the Anna paper [2], it kills any possibility for shared memory multithreading as a concurrency model to compete with anything, it's just too broken on a fundamental level.

[1] https://github.com/scylladb/seastar/wiki/Memcached-Benchmark

[2] https://dsf.berkeley.edu/jmh/papers/anna_ieee18.pdf


I believe one should not confine event-driven only to applications that don't do synchronisation, that's part of the misconception that leads to thinking event-driven has higher performance. This is due to the fact that part of the problem with threads (as you mentioned) is synchronisation, but this is the same problem with event-driven applications. We have a web server experiment that shows comparable performance to an event-driven web server (ulib) with its various hacks to make it faster and is always on the top list on tech-empower.

Regarding memcached, the first reference you posted is from 2015, yes in 2015 memcached was in a very bad shape in terms of synchronisation and things have significantly changed from that version with locks per hash bucket rather than a global lock, avoiding try_lock and .... So those results are too old to rely on. Seastar moves network stack to user-level and if I remember correctly the scheduler consisted of multiple ever looping threads even if there was no work to do. Considering in our experiments 40-60% of the memcached overhead were coming from network I/O, there is no surprise about their results. I would call this a hack for sure.

I have not read the Anna paper to be able to comment, but it seems that they are creating a key-value store using the actor model. I briefly schemed through the paper, and I did not find anything that points to "killing any possibility for shared memory multi-threading as a concurrency model"; this is a very bold claim and if they do claim that, they should have really strong results.

But my guess is anna's whole actor model was implemented on top of threads and shared memory multi-threading? Which will be in contrast of that bold claim. I worked with actor models and implemented one as well, it is a perfect fit for many use cases but threads at least for now are the bread and butter of multicore programming.

Having said that, all these models have their respective place in the software world. What we are trying to show in our paper , through thorough experiments, is that the misconception that event-driven has higher performance than thread programming is not fundamentally true. Therefore, falling into asynchronous programming and create hard to maintain applications [1] only due to better performance has no merit.

[1] https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-k...


[flagged]


You know what's really silly? Conflating "event driven vs. threads" with "shared memory vs. non-shared". They're two different things. In fact, "silly" is too charitable a word to describe your abuse of terminology to support a clearly-refuted point.


We've asked you many times to stop breaking the site guidelines in comments here. If you keep doing it, we are going to have to ban you. Please review the rules and stick to them from now on: https://news.ycombinator.com/newsguidelines.html.


Events can handle much more connections than a thread based approach. For example nginx is implemented with event-driven architecture: https://www.nginx.com/blog/inside-nginx-how-we-designed-for-...

> NGINX scales very well to support hundreds of thousands of connections per worker process. Each new connection creates another file descriptor and consumes a small amount of additional memory in the worker process. There is very little additional overhead per connection. NGINX processes can remain pinned to CPUs. Context switches are relatively infrequent and occur when there is no work to be done.

> In the blocking, connection‑per‑process approach, each connection requires a large amount of additional resources and overhead, and context switches (swapping from one process to another) are very frequent.

This is their explanation on events vs threading approach. Still, a lot of web servers today use a thread-per-connection which is acceptable since a database(e.g. postgres) performance degrades slowly as more active connections are introduced.[1]

[1] https://brandur.org/postgres-connections


I guess it depends.

I guess I’m somewhat used to the asynchronous I/O model that things like Win16 and JavaScript use. You have to think a bit more about the unpredictable order that things happen in, but not about preemption race conditions and data corruption.

The only threading model I have seen that I care for is Erlang’s. The C ... Java model is a data corruption death trap.

At that, Erlang kind of straddles the gap between independent execution and communicating sequential processes.


Oh wait, the article is about performance, not safety. Real programmers don’t care about reliability, only if it’s performant at interweb scale.

Meh.


and then there was that just two days ago: https://news.ycombinator.com/item?id=22165193


Haha I think that's the point of OP's post IMO, and people were even discussing this dichotomy in that thread as well.

Total gold, nothing in this world is clean or definitive.


Yes :). I do not actually have a strong opinion on the Event/Threads/Coroutines debate. I think they should all be considered on a case by case basis.


I'll restate the same thing I commented on that post:

https://news.ycombinator.com/item?id=22174201

People only now start to understand that the problem with multi-core is memory speed and the only way to solve that is by using a virtual machine with a concurrent capable memory model = Java.

In-lined memory does not play well with multiple threads. Cache-misses is the way to parallelize code!


Threads and events are duals. So it is possible to do both well if, in the threads case, the implementation is done right. The author makes the case he's done implementations well. The salient question then is: can the average programmer do threads based concurrent programming well? As dual, I think we can all agree that more effort/intelligence will lead to better solutions for threads or events.


This paper is a followup to:

* J. K. Ousterhout. "Why Threads Are A Bad Idea (for mostpurposes)". (http://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf)

* V. S. Pai, P. Druschel, and W. Zwaenepoel. "Flash: An Efficientand Portable Web Server". (https://www.usenix.org/legacy/events/usenix99/full_papers/pa...)

* M. Welsh, D. E. Culler, and E. A. Brewer. "SEDA: An architecturefor well-conditioned, scalable Internet services". (http://www.sosp.org/2001/papers/welsh.pdf)

Those are some of the more influential papers in the development of event-driven designs.


Circa the time that paper was written there was a lot of talk about "single-process" web servers. I was working on a web site from which people downloaded files and the founder was concerned that Apache used too much memory for that kind of request (we had many dial-up users) and we tried a few of the open source event-based web servers and at the time it seemed that many of them would corrupt data -- we would get all sorts of reports about it from users.


Didn't get past the opening paragraph. Sure, this is from 2003 but they're wrong. Look at redis.

Thread-per-connection is just plain dumb. Threads aren't free. Context-switching is a thing. Modern advice is to try not have more threads than cores. Thread-local storage is a very useful, having 1000's of thread may make TLS unfeasible.


The paper is arguing in favour of userspace threads or M:N threading, which were fairly popular in the late '90s, before falling out of favour. Then got popular again with go.


NodeJS seems to be doing alright with its event loop.


I might add the caveat from the title to this HN submission: “(for high-concurrency servers)”

And mention 2003 :)


Added. Thanks!


No, events are actually a spectacular idea.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: