- The overhead of a thread context switch is expensive due to cache thrashing
- If the context switch is due to IO readiness, the overhead is equivalent for both
- The main benefit of async is the coding model and the power it provides (cancellation, selection)
- Threads won't scale past tens of thousands of connections
- Async is about better utilizing memory, not performance
- A normal-sized Linux server can handle a million threads without much trouble, async isn't worth it (harder coding model, colored functions)
Is the context switch overhead the same for both? If so, why can't threads scale to 100k+ connections?
Wouldn't the same constraints be imposed on a million async tasks? What allows async tasks to be more active than threads? Is it the scheduling model, the overhead of context switching, or something else?
EDIT: I reread and saw that you mentioned context switching. I guess my question would then be the same as here: https://news.ycombinator.com/item?id=29452759. Is the claim that the context switching overhead when due to I/O readiness the same true? I'm mostly thinking about network servers, where I think most? context switches would be due to I/O readiness.
Yes, switching from one activity to another also has costs, but you only pay what you need to pay and much of the overhead is handled automatically.
That said, blocking-for-IO threads do have one huge advantage - the OS knows when to wake them up. With async, something has to look for changes in IO state. Check too often, and you're wasting time. Don't check often enough and you're wasting bandwidth.
Regardless of how the OS handles IO, user code can either poll the OS or have the OS wakeup a thread when there's a change, such as data for a blocking call.
Perhaps “user-land” is that missing aspect.
Think how zero-buf embedded network software architect deal with no interrupt and no context switching.
Data comes in. Via some combination of device configuration and device handling code, it gets put somewhere. Meanwhile, there's some other code that will process said data, but the system is doing lots of other things.
There are two ways to know when to run said "other code". The code that handled the hardware can start said other code or something else in the system can ask "is there new data" periodically. One implementation of "can start" is waking up a thread.
Note that the "interrupt vs polling" decision for the "handle hardware" code does not dictate which way is used to know when to run said other code.
Switching async tasks should have a smaller overhead than switching threads. A context switch involves giving control to the OS and then the OS giving it back at some point. Both involve lots of cache trashing, as you said, and both switches involve bookkeeping that the OS has to do. This probably also involves instruction-cache evictions.
Switching async tasks involves loading the new task from memory onto the stack. That's it. The program can immediately start doing useful work again.
So, the short answer is that switching threads involves:
- more cache evictions
- bigger, slower cache evictions
- OS-level bookkeeping for threads
One of the points on the list was that if the context switch is due to I/O readiness, then there's more work for the async task to do :
> Think about it this way—if you have a user-space thread which wakes up due to I/O readiness, then this means that the relevant kernel thread woke up from epoll_wait() or something similar. With blocking I/O, you call read(), and the kernel wakes up your thread when the read() completes. With non-blocking I/O, you call read(), get EAGAIN, call epoll_wait(), the kernel wakes up your thread when data is ready, and then you call read() a second time.
> In both scenarios, you’re calling a blocking system call and waking up the thread later.
This is true for readiness-based IO (which, admittedly, most current async IO loops are using), but completion-based IO (such as IOCP or io_uring) don't suffer from this problem: you just add your IO operation to the queue, and do a single syscall that will return once one of the operation in your queue is completed AFAIK.
Right, so if most async I/O frameworks use readiness-based IO (epoll, kqueue), then the context switch overhead is similar. So then most of the performance arguments for async I/O don't stand. That's where I'm confused :)
In theory, a compiler can (should) rearrange order of execution for non-async programs also.
- Imagine you have a monolith that mostly talks to the database. You have a primary with multiple read replicas. How are you going to take advantage of millions of async tasks if your IO concurrency ends up being caped at a couple of thousand database connections?
- The second question is: even if you can have a million requests on flight do you really want to have such a large blast radius on a single server?
You use a connection pool and multiplex queries over a few persistent connections. This problem exists regardless of whether you use async or thread-per-client. Process per client (i.e. PHP CGI) requires an external connection pool like pgpool.
> - The second question is: even if you can have a million requests on flight do you really want to have such a large blast radius on a single server?
What's the alternative? When there's a billion connections going through round robin DNS to a thousand load balancers and another hundred million going to backups on failover, everyone's going to have to handle that level of traffic. The scale of the internet is huge and it only just barely works because many services (like DNS, LBs, and databases + their read replicas) can operate at that kind of scale.
The people solving these kinds of hard problems have a lot of money that ends up sloshing around via open source github projects, thus influencing the wider engineering community. Many naturally gravitate towards these approaches because they are well tested and usually guaranteed overkill, right in that sweet spot between cover-my-ass and proven-but-shiny. In reality, async is one of those "if you have to ask, you [probably] don't need it" kind of things.
One thing I have been using as a way of understanding hi-perf backends is analyzing why the vert.x framework  (and its underlying server, netty) does so well on benchmarks , but as newb, I do not think I would get a lot form that exercise without a little hand holding.
 As of today, #28 on tech empower benchmarks - https://www.techempower.com/benchmarks/
But first and foremost, concurrency is not parallelism . In my practical experience dealing with both, asycn is a language semantic for concurrency, threads are an OS feature for parallelism. It's a little nonsensical to ask the pros/cons since they are different entirely ; you can use both at the same time (and in practice most async implementations use threads)
 https://www.youtube.com/watch?v=oV9rvDllKEg "Concurrency is not Parallelism" talk by Rob Pike.
+ Simpler programming model
- More resource intensive, so you usually need more user code to use them effectively
+ Less resource intensive, so you can have more of them and need less user code for management
- More complex programming model
+ Simplest programming model
+ Resource efficient
- Requires a runtime
Isn't Go preemptive multi-tasking instead of cooperative multi-tasking?