Hacker News new | past | comments | ask | show | jobs | submit login
Async I/O for Dummies (2018) (alwaysrightinstitute.com)
61 points by mpweiher 53 days ago | hide | past | favorite | 27 comments



I'm quire confused about the benefits of async I/O over a blocking thread pool. I can't reconcile the many claims I've read. Everyone does seem to agree that most people don't need async, but past that...

- The overhead of a thread context switch is expensive due to cache thrashing

- If the context switch is due to IO readiness, the overhead is equivalent for both

- The main benefit of async is the coding model and the power it provides (cancellation, selection)

- Threads won't scale past tens of thousands of connections

- Async is about better utilizing memory, not performance

- A normal-sized Linux server can handle a million threads without much trouble, async isn't worth it (harder coding model, colored functions)

Is the context switch overhead the same for both? If so, why can't threads scale to 100k+ connections?


Whether a single server can handle a million threads depends on how often those threads need to wake up, and how much work they need to do before going back to sleep. Thread pools don't even come close to working well for storage IO, because dozens or hundreds of threads per core each blocking on just a few kB of IO that the SSDs can handle in milliseconds or less means you spend far too much time on context switches. Network IO is usually lower in aggregate throughput and much higher in latency, so each thread spends relatively more time sleeping and therefore you can fit more of them into a single machine's available CPU time. But a million threads on one box imposes significant constraints on how active those threads can be.


> But a million threads on one box imposes significant constraints on how active those threads can be.

Wouldn't the same constraints be imposed on a million async tasks? What allows async tasks to be more active than threads? Is it the scheduling model, the overhead of context switching, or something else?

EDIT: I reread and saw that you mentioned context switching. I guess my question would then be the same as here: https://news.ycombinator.com/item?id=29452759. Is the claim that the context switching overhead when due to I/O readiness the same true? I'm mostly thinking about network servers, where I think most? context switches would be due to I/O readiness.


If nothing else, a green thread context switch is much cheaper than a real context switch because you don't have the security overhead of flushing caches/tlb/etc


The switch overhead need not be the same. When the OS wakes up a thread and schedules a processor to run it, it has to make sure that said processor has whatever state it needs. If said processor was previously doing something else, the OS has to save the relevant state. Note that the OS doesn't know what's important, so both save and restore are conservative.

Yes, switching from one activity to another also has costs, but you only pay what you need to pay and much of the overhead is handled automatically.

That said, blocking-for-IO threads do have one huge advantage - the OS knows when to wake them up. With async, something has to look for changes in IO state. Check too often, and you're wasting time. Don't check often enough and you're wasting bandwidth.


interrupts?


Polling driven async is faster than callback based async up and down the stack almost without exception, including interrupts. Interrupts are incredibly slow on modern hardware.


Interrupts vs polling hardware is in the OS.

Regardless of how the OS handles IO, user code can either poll the OS or have the OS wakeup a thread when there's a change, such as data for a blocking call.


As a long-time embedded engineer, I violently recoiled at the over-generalization: “in the OS”.

Perhaps “user-land” is that missing aspect.

Think how zero-buf embedded network software architect deal with no interrupt and no context switching.


Even in embedded land, there is code that manages IO devices. And, in much/most embedded systems, there is some separation of concerns, if only because such systems do multiple things.

Data comes in. Via some combination of device configuration and device handling code, it gets put somewhere. Meanwhile, there's some other code that will process said data, but the system is doing lots of other things.

There are two ways to know when to run said "other code". The code that handled the hardware can start said other code or something else in the system can ask "is there new data" periodically. One implementation of "can start" is waking up a thread.

Note that the "interrupt vs polling" decision for the "handle hardware" code does not dictate which way is used to know when to run said other code.


Interrupts vs hardware polling is in the OS. I'm writing about how user code gets "data ready" from the OS, which is independent of how the OS handles hardware.


Your last sentence makes more sense. But it is in no way related to your first.


I'm only familiar with Rust's async story, though I think the following probably applies to other languages as well.

Switching async tasks should have a smaller overhead than switching threads. A context switch involves giving control to the OS and then the OS giving it back at some point. Both involve lots of cache trashing, as you said, and both switches involve bookkeeping that the OS has to do. This probably also involves instruction-cache evictions.

Switching async tasks involves loading the new task from memory onto the stack. That's it. The program can immediately start doing useful work again.

So, the short answer is that switching threads involves:

    - more cache evictions
    - bigger, slower cache evictions
    - OS-level bookkeeping for threads
Also creating threads involves setting up a new, expandable stack in the process's memory space. Creating a new task involves allocating a fixed amount of memory on the heap once.


> Switching async tasks involves loading the new task from memory onto the stack. That's it.

One of the points on the list was that if the context switch is due to I/O readiness, then there's more work for the async task to do [0]:

> Think about it this way—if you have a user-space thread which wakes up due to I/O readiness, then this means that the relevant kernel thread woke up from epoll_wait() or something similar. With blocking I/O, you call read(), and the kernel wakes up your thread when the read() completes. With non-blocking I/O, you call read(), get EAGAIN, call epoll_wait(), the kernel wakes up your thread when data is ready, and then you call read() a second time.

> In both scenarios, you’re calling a blocking system call and waking up the thread later.

0: https://news.ycombinator.com/item?id=26110699


> this means that the relevant kernel thread woke up from epoll_wait() or something similar. With blocking I/O, you call read(), and the kernel wakes up your thread when the read() completes. With non-blocking I/O, you call read(), get EAGAIN, call epoll_wait(), the kernel wakes up your thread when data is ready, and then you call read() a second time.

This is true for readiness-based IO (which, admittedly, most current async IO loops are using), but completion-based IO (such as IOCP or io_uring) don't suffer from this problem: you just add your IO operation to the queue, and do a single syscall that will return once one of the operation in your queue is completed AFAIK.


> This is true for readiness-based IO (which, admittedly, most current async IO loops are using)

Right, so if most async I/O frameworks use readiness-based IO (epoll, kqueue), then the context switch overhead is similar. So then most of the performance arguments for async I/O don't stand. That's where I'm confused :)


Async programs are doing lazy evaluation, so they are better at discovering of critical path of execution. It's similar to out of order execution of instructions in CPU, but on a higher level.

In theory, a compiler can (should) rearrange order of execution for non-async programs also.


I also have some doubts for some specific scenarios:

- Imagine you have a monolith that mostly talks to the database. You have a primary with multiple read replicas. How are you going to take advantage of millions of async tasks if your IO concurrency ends up being caped at a couple of thousand database connections?

- The second question is: even if you can have a million requests on flight do you really want to have such a large blast radius on a single server?


> - Imagine you have a monolith that mostly talks to the database. You have a primary with multiple read replicas. How are you going to take advantage of millions of async tasks if your IO concurrency ends up being caped at a couple of thousand database connections?

You use a connection pool and multiplex queries over a few persistent connections. This problem exists regardless of whether you use async or thread-per-client. Process per client (i.e. PHP CGI) requires an external connection pool like pgpool.

> - The second question is: even if you can have a million requests on flight do you really want to have such a large blast radius on a single server?

What's the alternative? When there's a billion connections going through round robin DNS to a thousand load balancers and another hundred million going to backups on failover, everyone's going to have to handle that level of traffic. The scale of the internet is huge and it only just barely works because many services (like DNS, LBs, and databases + their read replicas) can operate at that kind of scale.

The people solving these kinds of hard problems have a lot of money that ends up sloshing around via open source github projects, thus influencing the wider engineering community. Many naturally gravitate towards these approaches because they are well tested and usually guaranteed overkill, right in that sweet spot between cover-my-ass and proven-but-shiny. In reality, async is one of those "if you have to ask, you [probably] don't need it" kind of things.


I/O perf is usually a measure of throughput, and while concurrent connections are certainly part of that calculus other factors tend to dominate.


Java's new preview of virtual threads is relevant. It promises the ease-of-use and ease-of-debugging of synchronous blocking threads, but the scalability and low memory usage of asynchrony. https://openjdk.java.net/jeps/8277131 ; https://news.ycombinator.com/item?id=29236375


Backend newb, here. I have a dumb question - can someone give me a summary, or blog post to a summary, of the pros /cons of async versus threads?

One thing I have been using as a way of understanding hi-perf backends is analyzing why the vert.x framework [1] (and its underlying server, netty) does so well on benchmarks [2], but as newb, I do not think I would get a lot form that exercise without a little hand holding.

[1] https://vertx.io/

[2] As of today, #28 on tech empower benchmarks - https://www.techempower.com/benchmarks/


As with all benchmarks, benchmark your workload. The tech empower stuff is gamed, like all rankings.

But first and foremost, concurrency is not parallelism [1]. In my practical experience dealing with both, asycn is a language semantic for concurrency, threads are an OS feature for parallelism. It's a little nonsensical to ask the pros/cons since they are different entirely ; you can use both at the same time (and in practice most async implementations use threads)

[1] https://www.youtube.com/watch?v=oV9rvDllKEg "Concurrency is not Parallelism" talk by Rob Pike.


OS Threads:

+ Simpler programming model

- More resource intensive, so you usually need more user code to use them effectively

Async

+ Less resource intensive, so you can have more of them and need less user code for management

- More complex programming model

Green threads:

+ Simplest programming model

+ Resource efficient

- Requires a runtime


> “Manual” async programming with callback functions/closures is not the only way to do cooperative multi-tasking. Google’s Go language takes another approach known as “green threads”, or “user level” threads (vs operating system / kernel managed threads).

Isn't Go preemptive multi-tasking instead of cooperative multi-tasking?


IIRC it used be cooperative (inserting yield point in specific places) but it caused trouble with tight loops and they eventually added preemption.


Not sure if it's Swift NIO introduction, or supposed to be a general asynchronous I/O introduction (as the title suggests). But if it's the latter, it seems to be quite narrow, only covering a few related/similar ways to use it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: