
Efficient IO with io_uring [pdf] - diegocg
http://kernel.dk/io_uring.pdf
======
zbjornson
I made a prototype of a replacement for Node.js' fs.read/write using io_uring
in hopes of it ultimately replacing libuv's current threadpool implementation.
Definitely promising! Many props to Jens for his hard work on this and for his
helpful discussions on the API.

Benchmarks at [0] and prototype repo at [1]

[0]
[https://github.com/libuv/libuv/issues/1947#issuecomment-4852...](https://github.com/libuv/libuv/issues/1947#issuecomment-485230126)
[1] [https://github.com/zbjornson/node-io-
uring](https://github.com/zbjornson/node-io-uring)

~~~
coder543
io_uring sounds promising.

It'll be interesting to see how long it takes for Tokio to add support for it
in the Rust community, and how long it takes for Go to transparently support
this instead of a pool of file I/O workers.

~~~
Matthias247
It will be interesting to see whether Rusts async ecosystem gets support for
it. However there is a little bit of a challenge in the form of an impedance
mismatch: Rust Futures and thereby also async functions require that an
operation is cancellable at any point of time (by dropping the respective
Future). IO completion based systems (like io_uring and IOCP) typically don't
support this. If the operation in the Rust userspace would be cancelled while
the Kernel proceeds with the operation the Kernel would act on invalid memory.
There are some workarounds for this, like wrapping the APIs in ones that only
used owned buffers that are passed for the whole duration of the operation and
reference-counting the IO objects (file descriptors). But the behavior is
pretty tricky. [1] describes the problem.

[1] [https://github.com/rust-lang-nursery/futures-
rs/issues/1278](https://github.com/rust-lang-nursery/futures-rs/issues/1278)

~~~
coder543
That's really interesting. I think some kind of reference counting and a
shadow event loop is going to be necessary, if that's the case, so that I/O
futures can always be driven to completion even if the developer-facing one is
dropped.

~~~
Matthias247
Yes, reference-counting of outstanding operations and the used buffers is a
suitable workaround.

It comes with some downsides though too: The IO operations then only accept
certain specialized buffer types and nothing generic like pure byte slices
anymore, which might make those things less interoperable. The other issue is
that while using refcounting leaves at least the memory in a non-corrupted
state the IO source will be in a somewhat undefined state after the
cancellation. E.g. if a write operation had been issued on the socket, the API
user can not make any assumptions whether or not any byte had been written,
and which impact a subsequent write might have. The only reasonable thing to
do is to close the IO source (where the close might also need to wait until
the still pending background task has completed).

~~~
coder543
If the future has been dropped, I agree that the source/sink should be closed
as soon as safely possible. Usually the resource has to be moved into a future
for it to be used, and thus it's logical that no one could continue using it
after the future is dropped.

------
kazinator
I did something very similar for a logging system, more than ten years ago.

Circular buffers were shared between user space and kernel. (Allocated by user
space, mapped into kernel with get_user_pages.)

User space threads just added messages to their circular buffer by copying in
data, hitting a barrier and updating a head pointer.

Like io_uring, I avoided too many system calls by having a kernel thread poll
for changes. User space would only make a system call when the ring would near
a "high water mark", like 85% full or whatever.

The circular buffers could outlive the process which created them and attached
them. A process could crash and go away; the kernel thread would still process
the last messages written right up to the crash.

Not everyone around was keen on the idea. What do you mean just stick it into
a buffer and the kernel will pick it up. Polling? Bah.

~~~
strictfp
Interesting implementation. How did the buffers survive process crash? Is that
a side-effect of get_user_pages or did you have a userspace daemon?

~~~
kazinator
It was a side effect of get_user_pages, which obtains durable references to
pages that have to be explicitly dropped. These were held by the logging
module independently of the processes interacting with it.

(The kernel thread was actually provide by a system call from user space. So
there was a daemon as a user space program, but that just called into the
kernel, where it then sat in a loop. This is nice: you have something in user
space that you can kill with SIGTERM or whatever and restart by running some
/usr/bin/executable.)

If the daemon terminated, that didn't destroy any of the buffers; they would
just sit there getting backlogged. New ones could be created, too (via ioctl
calls on a character device in /dev).

One aspect was that get_user_pages doesn't give you a linear block of memory.
I wrote the grotty C code to dequeue messages from the non-linearly-mapped
circular buffer. The message headers could straddle page boundaries, so I
copied those out into temporaries before working with them. I think there was
some attempt to avoid making unnecessary copies of the bulk data.

~~~
sebcat
> One aspect was that get_user_pages doesn't give you a linear block of
> memory. I wrote the grotty C code to dequeue messages from the non-linearly-
> mapped circular buffer. The message headers could straddle page boundaries,
> so I copied those out into temporaries before working with them. I think
> there was some attempt to avoid making unnecessary copies of the bulk data.

Open question: would it have helped to have a CMA[1] region managed by a
kernel module and mapping that memory/regions of that memory to user-space
using remap_pfn_range[2] instead?

[1] CMA documentation file:
[https://lwn.net/Articles/396707/](https://lwn.net/Articles/396707/)

[2] [https://www.kernel.org/doc/htmldocs/kernel-api/API-remap-
pfn...](https://www.kernel.org/doc/htmldocs/kernel-api/API-remap-pfn-
range.html)

~~~
kazinator
That jogs my memory; in fact, aha, I used get_user_pages in an early version,
and switched to remap_pfn_range; then I nuked all that complicated non-linear
access code. The ioctl for attaching the buffer then allocated the memory, and
its interface changed to passing down a size and receiving a pointer.

------
PyroLagus
Interestingly, Genode actually also uses two rings for async bulk transfer RPC
(packet streams)[0], only the rings are called submit queue and
acknowledgement queue rather than submission queue and completion queue. And
since Genode applications do I/O by communicating with a service implementing
the block session interface[1], which uses the aforementioned packet streams,
Genode essentially uses two ring buffers for async I/O just like Linux does
(well, I guess it's more the other way around).

I can definitely recommend reading through the first half of Genode
Foundations (the second half is just an API reference). It's really
interesting and informative, and it's a surprisingly easy read as well.
However, I'd wait till the 19.05 edition gets released, since that should be
less than a month's wait if they manage to release it on time.

[0] [https://genode.org/documentation/genode-
foundations-18-05.pd...](https://genode.org/documentation/genode-
foundations-18-05.pdf) (page 92) [1] (pages 94 & 116)

~~~
salotz
Was about to say the same thing! +1 for reading the Genode manual. It is the
best resource I have ever found for understanding operating systems, security,
and architecture. The discussion of multiplexers and report-ROM modules is
pertinent here as well (pg 133).

------
aseipp
For a bit of advertisement and any interested parties: I packaged liburing in
the NixOS unstable branch a week or two ago (Jens hasn't done a stable release
just yet, however), and I also pushed the 5.1 kernel release into unstable
last night. Hurrah for a new, efficient I/O interface finally appearing!

------
dmoreno
As explained by LWN:
[https://lwn.net/Articles/776703/](https://lwn.net/Articles/776703/)

------
waynesonfire
Can anyone comment on how this interface compares to state-of-the-art async
i/o in other *unix o/s? Is Linux playing catch up here or innovating?

~~~
asveikau
I am not at up to date on this either, but as far as I know, the old status
quo was this:

For non-blocking sockets and the like you have what grew out of the readiness
model from select and poll. Today on Linux that's epoll.

This model never worked for files on disk. There is the POSIX aio API for
that. On Linux, glibc implements that and it's always been based on
synchronous I/O in thread pools because the kernel support isn't good. And I
guess there's been Linux specific things like io_submit(2), which as the
article states only worked with O_DIRECT and sometimes fell back to
synchronous I/O.

It says the new mechanism works with both files and sockets, but I imagine
it's still reasonable to use epoll for sockets, and only use this new stuff
for files on disk. One good thing about the readiness model for sockets is you
don't need to allocate memory upfront every potential read the remote host
might trigger.

~~~
aarongolliver
Can you (or someone) explain _why_ the model translate to files on disk? Is it
that there's a different set of operations that need to be supported? (you
can't seek a socket I suppose)

~~~
asveikau
The model for aio is "here is a buffer, let me know when you've done the
operation". Your request can sit in a queue somewhere or whatever for as long
as it takes.

The model for the various FD polling mechanisms is "wake me up when read(2)
won't return EAGAIN when I call it later". That is very natural for sockets,
because what read(2) is doing is draining a kernel side receive buffer, so the
question is "wake me up when the buffer isn't empty". If the FD were a file on
disk, knowing ahead of time that read(2) isn't going to return quickly is kind
of a clumsy task.

~~~
bogomipz
>"If the FD were a file on disk, knowing ahead of time that read(2) isn't
going to return quickly is kind of a clumsy task."

But isn't this exactly what happens? A process running in kernel mode calls
schedule() and sleeps waiting for I/O. The reason for calling schedule() is
that it already does know read isn't going to return quickly. But maybe I'm
completely misunderstanding your explanation about why current AIO on Linux is
clunky?

~~~
asveikau
The key observation is this _ahead of time_ thing. If you've reached the point
of invoking the scheduler you have already done a bunch of work to decide
that, based on a request that has already started. On the other hand if
epoll_wait returned and said the fd was ready, but the read hasn't yet been
issued, who is to say the conditions will be the same later? Keep in mind also
that other processes can do things that evicts caches, or make other requests
to the same disk or controller.

For all of these cases I suppose you could make the I/O fail with EAGAIN, but
this seems like a lot of potential for spurious wakeups.

~~~
bogomipz
Thanks for the clarification, that makes good sense. Cheers.

------
ignoramous
The problems identified for aio on Linux are:

> (parts snipped for brevity)

> _The biggest limitation is undoubtedly that it only supports async IO for
> O_DIRECT (or un-buffered) accesses. Due to the restrictions of O_DIRECT
> (cache bypassing and size /alignment restraints), this makes the native aio
> interface a no-go for most use cases._

> _There are a number of ways async-io submission can end up blocking - if
> meta data is required to perform IO, the submission will block waiting for
> that. For storage devices, there are a fixed number of request slots
> available. If those slots are currently all in use, submission will block
> waiting for one to become available._

> _The API isn 't great. Each IO submission ends up needing to copy 64 + 8
> bytes and each completion copies 32 bytes. That's 104 bytes of memory copy.
> IO always requires at least two system calls (submit + wait-for-completion),
> which in these post spectre/meltdown days is a serious slowdown._

The io_uring soln:

> _With a shared ring buffer, we could eliminate the need to have shared
> locking between the application and the kernel, getting away with some
> clever use of memory ordering and barriers instead. There are two
> fundamental operations associated with an async interface: the act of
> submitting a request, and the event that is associated with the completion
> of said request. For submitting IO, the application is the producer and the
> kernel is the consumer. The opposite is true for completions - here the
> kernel produces completion events and the application consumes them. Hence,
> we need a pair of rings to provide an effective communication channel
> between an application and the kernel. That pair of rings is at the core of
> the new interface, io_uring. They are suitably named submission queue (SQ),
> and completion queue (CQ), and form the foundation of the new interface._

> _The cqes are organized into an array, with the memory backing the array
> being visible and modifiable by both the kernel and the application.
> However, since the cqe 's are produced by the kernel, only the kernel is
> actually modifying the cqe entries. The communication is managed by a ring
> buffer. Whenever a new event is posted by the kernel to the CQ ring, it
> updates the tail associated with it. When the application consumes an entry,
> it updates the head. Hence, if the tail is different than the head, the
> application knows that it has one or more events available for consumption.
> The ring counters themselves are free flowing 32-bit integers, and rely on
> natural wrapping when the number of completed events exceed the capacity of
> the ring._

> _For the submission side, the roles are reversed. The application is the one
> updating the tail, and the kernel consumes entries (and updates) the head.
> One important difference is that while the CQ ring is directly indexing the
> shared array of cqes, the submission side has an indirection array between
> them. Hence the submission side ring buffer is an index into this array,
> which in turn contains the index into the sqes. This might initially seem
> odd and confusing, but there 's some reasoning behind it. Some applications
> may embed request units inside internal data structures, and this allows
> them the flexibility to do so while retaining the ability to submit multiple
> sqes in one._

I've seen the ring-buffer design pattern employed here by io_uring in log4j
via disruptor [0] and the queue structure in libfabric interface for
Infiniband [1] for both command/control and data.

Happy to see Jens Axboe [2] committing his time to improve story around aio.

[0] [https://www.baeldung.com/lmax-disruptor-
concurrency](https://www.baeldung.com/lmax-disruptor-concurrency)

[1] [https://zcopy.wordpress.com/2010/10/08/quick-concepts-
part-1...](https://zcopy.wordpress.com/2010/10/08/quick-concepts-
part-1-%e2%80%93-introduction-to-rdma/)

[2]
[https://en.wikipedia.org/wiki/Jens_Axboe](https://en.wikipedia.org/wiki/Jens_Axboe)

------
Const-me
I like the design. On paper, it's even more efficient than Windows registered
I/O.

However, one thing, in 8.3

> This is done through a kernel thread, speciﬁc to that io_uring.

Does it means it will be 100k native kernel threads for an app that has 100k
open sockets? Won't this ruin scalability?

~~~
Matthias247
It's one thread per ring, not one thread per socket or FileDescriptor that
dispatches events to the ring.

~~~
Const-me
How many rings should I create on a server with 100GB RAM, 64 cores, that
handles 100k connections?

Even if there's a definite answer to that, I still don't understand why
threads are necessary? While CPU waits for completed IO, it doesn't poll
anything. The CPU fires a DMA request at hardware, and forgets about it until
it gets the response.

~~~
geocar
It depends on your application, so measure!

Ideally the CPU never waits for completed IO, it's busy building the next set
of responses, or writing out the new state to the rest of the application. If
you're ever "waiting", then you should probably be putting another/bigger nic
in there or rethinking what your application actually needs to do:

10Gb ethernet is only about 10x slower than main memory, so 10 threads could
be enough if you have a single memory fetch to service the response, and if
you need two, then 20 threads. If you need more than six, then you should
definitely rearchitect your application, since you're out of cores and time-
slicing again.

~~~
cafxx
> 10Gb ethernet is only about 10x slower than main memory

Server-class CPUs have much higher bandwidth nowadays:

> The total memory bandwidth available to a single CPU clocks in at 170 GB/s.

(i.e. ~136x slower than main memory)

[https://www.anandtech.com/show/11183/amd-prepares-32-core-
na...](https://www.anandtech.com/show/11183/amd-prepares-32-core-naples-cpus-
for-1p-and-2p-servers-coming-in-q2)

~~~
geocar
The per-channel number[1] is much better (imo) for napkin spec'ing because
it's what you'll get in a typical networked application that's already done
and minimised stalls.

But you're raising a good point: If you can arrange your data set across the
physical DIMMS yes you can get much another 10x, and maybe the contortions you
need to do to get it can be worth it (if you're actually memory bound).

[1]: better link, since the one you posted is bust:
[https://developer.amd.com/wp-
content/resources/56301_1.0.pdf](https://developer.amd.com/wp-
content/resources/56301_1.0.pdf)

------
gruez
How many async io APIs are we up to now?

~~~
kazinator
Give me your mobile number and go do something else, I will send you a
notification with the answer. If you run out of something else to do, then
just wait.

------
jorangreef
Does FreeBSD have anything like io_uring?

~~~
Const-me
As far as I know, it does not have similar functionality.

But I’m not sure it needs one. It has kqueue, and unlike epoll, kqueue is a
general-purpose way to get events from kernel. It’s compatible not just with
sockets, but also files and pipes, directory changes, processes, signals,
timers, devices, and more. Not unlike IOCP on Windows.

On paper new uring is faster, but kqueue is not particularly slow either. And
it’s available for couple decades now, i.e. it’s stable and tested really
well. I’m happy Linux finally getting fast general purpose async IO, but I’ll
only benefit from it after a year or so, after kernel updates are out, and
after higher-level runtimes like golang and .net core are updated accordingly.

~~~
machinecoffee
There's also libkqueue for Linux - I would imagine they will also adopt
io_uring when it's stable and ubiquitous enough, bringing the two APIs closer
together.

I find kqueue to be a great and well designed API, so that can only be a good
thing IMHO.

------
PaulHoule
How many times have we heard this story that there is a new async I/O
interface for Linux that is better than the others?

How is this one different?

~~~
xxpor
Using ring buffers is how networking has worked for a while now. They're nice
because they're extremely fast, due to being in mechanical sympathy with
hardware.

[http://mechanitis.blogspot.com/2011/06/dissecting-
disruptor-...](http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-
whats-so-special.html)

So I can't really speak to the API details, but the general concept seems like
a good idea to me.

~~~
kazinator
For values of "a while" probably exceeding forty years.

~~~
xxpor
Ha. I originally wrote "a long time" but I wasn't actually sure.

