Hacker News new | past | comments | ask | show | jobs | submit login
Efficient IO with io_uring [pdf] (kernel.dk)
255 points by diegocg 14 days ago | hide | past | web | favorite | 64 comments



I made a prototype of a replacement for Node.js' fs.read/write using io_uring in hopes of it ultimately replacing libuv's current threadpool implementation. Definitely promising! Many props to Jens for his hard work on this and for his helpful discussions on the API.

Benchmarks at [0] and prototype repo at [1]

[0] https://github.com/libuv/libuv/issues/1947#issuecomment-4852... [1] https://github.com/zbjornson/node-io-uring


io_uring sounds promising.

It'll be interesting to see how long it takes for Tokio to add support for it in the Rust community, and how long it takes for Go to transparently support this instead of a pool of file I/O workers.


It will be interesting to see whether Rusts async ecosystem gets support for it. However there is a little bit of a challenge in the form of an impedance mismatch: Rust Futures and thereby also async functions require that an operation is cancellable at any point of time (by dropping the respective Future). IO completion based systems (like io_uring and IOCP) typically don't support this. If the operation in the Rust userspace would be cancelled while the Kernel proceeds with the operation the Kernel would act on invalid memory. There are some workarounds for this, like wrapping the APIs in ones that only used owned buffers that are passed for the whole duration of the operation and reference-counting the IO objects (file descriptors). But the behavior is pretty tricky. [1] describes the problem.

[1] https://github.com/rust-lang-nursery/futures-rs/issues/1278


I wonder if it would be possible to pass the ownership of a structure that contains the buffer to kernel, passing a pointer in the user_data. The application could then `std::mem::forget` the buffer, and re-construct it from the raw pointer when the completion event arrives. The Future that originated the IO job could then be dropped independently from the buffer.

That's really interesting. I think some kind of reference counting and a shadow event loop is going to be necessary, if that's the case, so that I/O futures can always be driven to completion even if the developer-facing one is dropped.

Yes, reference-counting of outstanding operations and the used buffers is a suitable workaround.

It comes with some downsides though too: The IO operations then only accept certain specialized buffer types and nothing generic like pure byte slices anymore, which might make those things less interoperable. The other issue is that while using refcounting leaves at least the memory in a non-corrupted state the IO source will be in a somewhat undefined state after the cancellation. E.g. if a write operation had been issued on the socket, the API user can not make any assumptions whether or not any byte had been written, and which impact a subsequent write might have. The only reasonable thing to do is to close the IO source (where the close might also need to wait until the still pending background task has completed).


If the future has been dropped, I agree that the source/sink should be closed as soon as safely possible. Usually the resource has to be moved into a future for it to be used, and thus it's logical that no one could continue using it after the future is dropped.

io_uring definitely wants to own the buffers, either for its entire lifetime (with preregistered memory mappings) or at least until completion.

if possible the future should take ownership of the buffer and have it returned on completion.


The bug is here, if you want to follow it: https://github.com/tokio-rs/mio/issues/923


https://github.com/stbuehler/rust-io-uring

Note that io_uring in linux 5.1 has various bugs; especially READ* and WRITE* on sockets/pipe don't work very well (afaik they either block submission or return -EAGAIN with O_NONBLOCK), so the example code in my repo isn't that useful for linux 5.1 (POLL_ADD works fine though).

For proper support in tokio this would have to be done in mio afaict (or tokio would have to replace mio with something else).


do those issues only apply to networking? the big benefit I see for io_uring is that it provides a truly async interface for file I/O for the first time ever in Linux... as far as I can tell.

I'm not sure about the details for file I/O; files with FMODE_NOWAIT should be fine (block devices, ext4, btrfs, ...).

The idea is that during submission an operation is started with IOCB_NOWAIT; an operation shouldn't block with that. If it returns EAGAIN it will be retried in another kernel thread without IOCB_NOWAIT. Afaik IOCB_NOWAIT support should be announced through FMODE_NOWAIT.

Right now operations that don't support IOCB_NOWAIT ("magic check" in io_file_supports_async [1]) are still run during submission but without the IOCB_NOWAIT flag.

Sockets/pipes don't handle IOCB_NOWAIT, they only care about O_NONBLOCK right now (which is a bug imho).

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


That's a huge improvement! Really looking forward to Node taking advantage of this.


If I'm reading your chart correctly, it looks like the thread pool of size 1 is still "beating" the uring, right? What would explain this?


Lower numbers are better in that chart, so the threadpool size one is only winning on the rightmost part of the plat: just barely at parallelism 16 and slightly more at 32.


Makes sense - thanks. Looks like this is a huge win for low parallelism system and a very minor loss at high parallelism.


I did something very similar for a logging system, more than ten years ago.

Circular buffers were shared between user space and kernel. (Allocated by user space, mapped into kernel with get_user_pages.)

User space threads just added messages to their circular buffer by copying in data, hitting a barrier and updating a head pointer.

Like io_uring, I avoided too many system calls by having a kernel thread poll for changes. User space would only make a system call when the ring would near a "high water mark", like 85% full or whatever.

The circular buffers could outlive the process which created them and attached them. A process could crash and go away; the kernel thread would still process the last messages written right up to the crash.

Not everyone around was keen on the idea. What do you mean just stick it into a buffer and the kernel will pick it up. Polling? Bah.


Interesting implementation. How did the buffers survive process crash? Is that a side-effect of get_user_pages or did you have a userspace daemon?


It was a side effect of get_user_pages, which obtains durable references to pages that have to be explicitly dropped. These were held by the logging module independently of the processes interacting with it.

(The kernel thread was actually provide by a system call from user space. So there was a daemon as a user space program, but that just called into the kernel, where it then sat in a loop. This is nice: you have something in user space that you can kill with SIGTERM or whatever and restart by running some /usr/bin/executable.)

If the daemon terminated, that didn't destroy any of the buffers; they would just sit there getting backlogged. New ones could be created, too (via ioctl calls on a character device in /dev).

One aspect was that get_user_pages doesn't give you a linear block of memory. I wrote the grotty C code to dequeue messages from the non-linearly-mapped circular buffer. The message headers could straddle page boundaries, so I copied those out into temporaries before working with them. I think there was some attempt to avoid making unnecessary copies of the bulk data.


> One aspect was that get_user_pages doesn't give you a linear block of memory. I wrote the grotty C code to dequeue messages from the non-linearly-mapped circular buffer. The message headers could straddle page boundaries, so I copied those out into temporaries before working with them. I think there was some attempt to avoid making unnecessary copies of the bulk data.

Open question: would it have helped to have a CMA[1] region managed by a kernel module and mapping that memory/regions of that memory to user-space using remap_pfn_range[2] instead?

[1] CMA documentation file: https://lwn.net/Articles/396707/

[2] https://www.kernel.org/doc/htmldocs/kernel-api/API-remap-pfn...


That jogs my memory; in fact, aha, I used get_user_pages in an early version, and switched to remap_pfn_range; then I nuked all that complicated non-linear access code. The ioctl for attaching the buffer then allocated the memory, and its interface changed to passing down a size and receiving a pointer.

Interestingly, Genode actually also uses two rings for async bulk transfer RPC (packet streams)[0], only the rings are called submit queue and acknowledgement queue rather than submission queue and completion queue. And since Genode applications do I/O by communicating with a service implementing the block session interface[1], which uses the aforementioned packet streams, Genode essentially uses two ring buffers for async I/O just like Linux does (well, I guess it's more the other way around).

I can definitely recommend reading through the first half of Genode Foundations (the second half is just an API reference). It's really interesting and informative, and it's a surprisingly easy read as well. However, I'd wait till the 19.05 edition gets released, since that should be less than a month's wait if they manage to release it on time.

[0] https://genode.org/documentation/genode-foundations-18-05.pd... (page 92) [1] (pages 94 & 116)


Was about to say the same thing! +1 for reading the Genode manual. It is the best resource I have ever found for understanding operating systems, security, and architecture. The discussion of multiplexers and report-ROM modules is pertinent here as well (pg 133).

For a bit of advertisement and any interested parties: I packaged liburing in the NixOS unstable branch a week or two ago (Jens hasn't done a stable release just yet, however), and I also pushed the 5.1 kernel release into unstable last night. Hurrah for a new, efficient I/O interface finally appearing!


As explained by LWN: https://lwn.net/Articles/776703/


Can anyone comment on how this interface compares to state-of-the-art async i/o in other *unix o/s? Is Linux playing catch up here or innovating?


I am not at up to date on this either, but as far as I know, the old status quo was this:

For non-blocking sockets and the like you have what grew out of the readiness model from select and poll. Today on Linux that's epoll.

This model never worked for files on disk. There is the POSIX aio API for that. On Linux, glibc implements that and it's always been based on synchronous I/O in thread pools because the kernel support isn't good. And I guess there's been Linux specific things like io_submit(2), which as the article states only worked with O_DIRECT and sometimes fell back to synchronous I/O.

It says the new mechanism works with both files and sockets, but I imagine it's still reasonable to use epoll for sockets, and only use this new stuff for files on disk. One good thing about the readiness model for sockets is you don't need to allocate memory upfront every potential read the remote host might trigger.


This explanation gave me that feeling of how utterly incredible it is at any software works as well as it does.

The amount of work that goes into supporting, debugging, developing, improving, and designing the sheer number of ways that we do things, at literally hundreds of layers of abstractions all which are slightly wrong and each of which does things their own way with their own tradeoffs.

I'm damn glad there are people a lot smarter and more dedicated than I am that keep this stuff running so that I can store a value in localstorage in a browser and have it punch through all of those layers in a fraction of a moment down to writing bits on one of like a dozen storage mediums on any number of OSs via multiple interfaces in multiple browsers.


Can you (or someone) explain _why_ the model translate to files on disk? Is it that there's a different set of operations that need to be supported? (you can't seek a socket I suppose)


The model for aio is "here is a buffer, let me know when you've done the operation". Your request can sit in a queue somewhere or whatever for as long as it takes.

The model for the various FD polling mechanisms is "wake me up when read(2) won't return EAGAIN when I call it later". That is very natural for sockets, because what read(2) is doing is draining a kernel side receive buffer, so the question is "wake me up when the buffer isn't empty". If the FD were a file on disk, knowing ahead of time that read(2) isn't going to return quickly is kind of a clumsy task.


>"If the FD were a file on disk, knowing ahead of time that read(2) isn't going to return quickly is kind of a clumsy task."

But isn't this exactly what happens? A process running in kernel mode calls schedule() and sleeps waiting for I/O. The reason for calling schedule() is that it already does know read isn't going to return quickly. But maybe I'm completely misunderstanding your explanation about why current AIO on Linux is clunky?


The key observation is this ahead of time thing. If you've reached the point of invoking the scheduler you have already done a bunch of work to decide that, based on a request that has already started. On the other hand if epoll_wait returned and said the fd was ready, but the read hasn't yet been issued, who is to say the conditions will be the same later? Keep in mind also that other processes can do things that evicts caches, or make other requests to the same disk or controller.

For all of these cases I suppose you could make the I/O fail with EAGAIN, but this seems like a lot of potential for spurious wakeups.


Thanks for the clarification, that makes good sense. Cheers.

Most of the problems with async i/o were spelled out very well in the recent "I/O is Faster Than CPU" [1] posted a few days ago.

[1] - https://news.ycombinator.com/item?id=19818899

The ring buffer is definitely innovation.


Looks like a little bit of both. Linux has a lot to catch up because epoll is a mess. It is limited to file descriptors with a readiness semantic (sockets, pipes, ttys) meaning that it will never work with normal files because they are always ready while blocking whenever they feel like it. The second API fuck up is the fork/exec semantic. Oh and they tried to get away with edge triggered instead of level triggered. All in all epoll is a just a nasty case of NIH. The epoll developers just wrote code without understanding the prior art in their field (e.g. kqueue).


epoll (and even kqueue) is not aio and was never intended as such. While epoll had some early snafus, those have long been fixed (offering ET and LT) and it works well for its intended purpose (io readiness). This (io_uring) is more for a thing like Windows overlapped IO and "unix" (incl FreeBSD) aio. AIO on linux is yet a separate set of syscalls (io_*) and has since day one in the 2.5 days been pretty much a useless joke, very limited in scope (O_DIRECT only).

kqueue is neither epoll nor aio but a common queueing interface for events, which includes io readiness and aio completion as separate types. Linux prefers to split these as different types of fds. FWIW, this wasn't due to obliviousness of the existing design - https://yarchive.net/comp/linux/event_queues.html.

Honestly your comment sounds very like a regurgitated old and somewhat outdated Bryan Cantrill rant.


The link doesn't explain much. The only argument that Linus has on kqueue is that is overengineered. Then it goes on a long multi message rant in how having multiple queues is insane.

Luckily, in the end epoll ended up supporting multiple queues at least.

I do believe that ET is good though.


What do you dislike about edge triggered mode?


The problems identified for aio on Linux are:

> (parts snipped for brevity)

> The biggest limitation is undoubtedly that it only supports async IO for O_DIRECT (or un-buffered) accesses. Due to the restrictions of O_DIRECT (cache bypassing and size/alignment restraints), this makes the native aio interface a no-go for most use cases.

> There are a number of ways async-io submission can end up blocking - if meta data is required to perform IO, the submission will block waiting for that. For storage devices, there are a fixed number of request slots available. If those slots are currently all in use, submission will block waiting for one to become available.

> The API isn't great. Each IO submission ends up needing to copy 64 + 8 bytes and each completion copies 32 bytes. That's 104 bytes of memory copy. IO always requires at least two system calls (submit + wait-for-completion), which in these post spectre/meltdown days is a serious slowdown.

The io_uring soln:

> With a shared ring buffer, we could eliminate the need to have shared locking between the application and the kernel, getting away with some clever use of memory ordering and barriers instead. There are two fundamental operations associated with an async interface: the act of submitting a request, and the event that is associated with the completion of said request. For submitting IO, the application is the producer and the kernel is the consumer. The opposite is true for completions - here the kernel produces completion events and the application consumes them. Hence, we need a pair of rings to provide an effective communication channel between an application and the kernel. That pair of rings is at the core of the new interface, io_uring. They are suitably named submission queue (SQ), and completion queue (CQ), and form the foundation of the new interface.

> The cqes are organized into an array, with the memory backing the array being visible and modifiable by both the kernel and the application. However, since the cqe's are produced by the kernel, only the kernel is actually modifying the cqe entries. The communication is managed by a ring buffer. Whenever a new event is posted by the kernel to the CQ ring, it updates the tail associated with it. When the application consumes an entry, it updates the head. Hence, if the tail is different than the head, the application knows that it has one or more events available for consumption. The ring counters themselves are free flowing 32-bit integers, and rely on natural wrapping when the number of completed events exceed the capacity of the ring.

> For the submission side, the roles are reversed. The application is the one updating the tail, and the kernel consumes entries (and updates) the head. One important difference is that while the CQ ring is directly indexing the shared array of cqes, the submission side has an indirection array between them. Hence the submission side ring buffer is an index into this array, which in turn contains the index into the sqes. This might initially seem odd and confusing, but there's some reasoning behind it. Some applications may embed request units inside internal data structures, and this allows them the flexibility to do so while retaining the ability to submit multiple sqes in one.

I've seen the ring-buffer design pattern employed here by io_uring in log4j via disruptor [0] and the queue structure in libfabric interface for Infiniband [1] for both command/control and data.

Happy to see Jens Axboe [2] committing his time to improve story around aio.

[0] https://www.baeldung.com/lmax-disruptor-concurrency

[1] https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1...

[2] https://en.wikipedia.org/wiki/Jens_Axboe


I like the design. On paper, it's even more efficient than Windows registered I/O.

However, one thing, in 8.3

> This is done through a kernel thread, specific to that io_uring.

Does it means it will be 100k native kernel threads for an app that has 100k open sockets? Won't this ruin scalability?


It's one thread per ring, not one thread per socket or FileDescriptor that dispatches events to the ring.


How many rings should I create on a server with 100GB RAM, 64 cores, that handles 100k connections?

Even if there's a definite answer to that, I still don't understand why threads are necessary? While CPU waits for completed IO, it doesn't poll anything. The CPU fires a DMA request at hardware, and forgets about it until it gets the response.


It depends on your application, so measure!

Ideally the CPU never waits for completed IO, it's busy building the next set of responses, or writing out the new state to the rest of the application. If you're ever "waiting", then you should probably be putting another/bigger nic in there or rethinking what your application actually needs to do:

10Gb ethernet is only about 10x slower than main memory, so 10 threads could be enough if you have a single memory fetch to service the response, and if you need two, then 20 threads. If you need more than six, then you should definitely rearchitect your application, since you're out of cores and time-slicing again.


> 10Gb ethernet is only about 10x slower than main memory

Server-class CPUs have much higher bandwidth nowadays:

> The total memory bandwidth available to a single CPU clocks in at 170 GB/s.

(i.e. ~136x slower than main memory)

https://www.anandtech.com/show/11183/amd-prepares-32-core-na...


The per-channel number[1] is much better (imo) for napkin spec'ing because it's what you'll get in a typical networked application that's already done and minimised stalls.

But you're raising a good point: If you can arrange your data set across the physical DIMMS yes you can get much another 10x, and maybe the contortions you need to do to get it can be worth it (if you're actually memory bound).

[1]: better link, since the one you posted is bust: https://developer.amd.com/wp-content/resources/56301_1.0.pdf


Would you put a new ethernet card into a machine where the network is saturated 5% of the time?

There's a huge space between 'never waiting' and 'always waiting'.


That section describes the "polling". You are right that with normal IO, there doesn't need to be polling, since the completion will be signaled by interrupt. However as described in the document polling can be faster for some use-cases. And for those there needs to be one thread which performs the polling for completions. That one can be in the Kernel.


Looked more closely, indeed you’re right.

That section discusses an optional mechanism how the kernel gets notified about new IO requests submit by the application. It’s not for IO completion polling. For that use case, having a kernel thread per uring makes perfect sense.


How many async io APIs are we up to now?


Give me your mobile number and go do something else, I will send you a notification with the answer. If you run out of something else to do, then just wait.


Just two actually (io_submit and io_uring), unless we are also counting network IO where we also have select/poll/epoll.

Does FreeBSD have anything like io_uring?

As far as I know, it does not have similar functionality.

But I’m not sure it needs one. It has kqueue, and unlike epoll, kqueue is a general-purpose way to get events from kernel. It’s compatible not just with sockets, but also files and pipes, directory changes, processes, signals, timers, devices, and more. Not unlike IOCP on Windows.

On paper new uring is faster, but kqueue is not particularly slow either. And it’s available for couple decades now, i.e. it’s stable and tested really well. I’m happy Linux finally getting fast general purpose async IO, but I’ll only benefit from it after a year or so, after kernel updates are out, and after higher-level runtimes like golang and .net core are updated accordingly.


There's also libkqueue for Linux - I would imagine they will also adopt io_uring when it's stable and ubiquitous enough, bringing the two APIs closer together.

I find kqueue to be a great and well designed API, so that can only be a good thing IMHO.


How many times have we heard this story that there is a new async I/O interface for Linux that is better than the others?

How is this one different?


The linked PDF does a very good job answering that and thankfully the first couple paragraphs should give a good start to understanding how and what it solves.

The specific point to make is that while this can work with socket handles, it's primarily aimed to allow more efficient work with files (where AIO has failed to provide much benefit outside of batching operations with fewer syscalls).

io_uring is also one of the first post-spectre/meltdown syscall designs to be merged into the Linux kernel. If the ring buffer style used in this API becomes a pattern or not will be interesting.


You tell me, how many?

Because to my knowledge, there's just one asynchronous IO interface Linux and it's Linux-AIO as used via io_setup()/io_submit()/io_getevents()/io_destroy() w/O_DIRECT.

There's glibc support for POSIX AIO, but that's not some kernel feature, it's just POSIX support and not especially great.

I'm inclined to suspect you're confusing asynchronous IO with the myriad file descriptor event multiplexing interfaces: select()/poll()/epoll_wait(), which are orthogonal to this.


Using ring buffers is how networking has worked for a while now. They're nice because they're extremely fast, due to being in mechanical sympathy with hardware.

http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-...

So I can't really speak to the API details, but the general concept seems like a good idea to me.


For values of "a while" probably exceeding forty years.


Ha. I originally wrote "a long time" but I wasn't actually sure.


forty years is probably a very conservative estimate!

Being usable for writing to disk is pretty huge. Solves so many problems for webservers and databases.


Literally the first thing addressed in the article


From the bibliography:

> In terms of benchmarks, I ran some numbers comparing io_uring to libaio and spdk. The tldr is that io_uring is pretty close to spdk, in some cases faster. Latencies over spdk are generally better. The areas where we are still missing a bit of performance all lie in the block layer, and I'll be working on that to close the gap some more.

From a status update in the article it looks like it's less than 2% slower than spdk now. Trading 2% throughput for lower latency sounds like a useful thing to me, as someone who gets judged by response times.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: