
Lord of the io_uring: io_uring tutorial, examples and reference - shuss
https://unixism.net/loti/
======
geofft
One thing this writeup made me realize is, if I have a _misbehaving_ I/O
system (NFS or remote block device over a flaky network, dying SSD, etc.), in
the pre-io_uring world I'd probably see that via /proc/$pid/stack pretty
clearly - I'd see a stack with the read syscall, then the particular I/O
subsystem, then the physical implementation of that subsystem. Or if I looked
at /proc/$pid/syscall I'd see a read call on a certain fd, and I could look in
/proc/$pid/fd/ and see which fd it was and where it lived.

However, in the post-io_uring world, I think I won't see that, right? If I
understand right, I'll at most see a call to io_uring_enter, and maybe not
even that.

How do I tell what a stuck io_uring-using program is stuck on? Is there a way
I can see all the pending I/Os and what's going on with them?

How is this implemented internally - does it expand into one kernel thread per
I/O, or something? (I guess, if you had a silly filesystem which spent 5
seconds in TASK_UNINTERRUPTIBLE on each read, and you used io_uring to submit
100 reads from it, what actually happens?)

~~~
Matthias247
I think that's a very reasonable concern. It however isn't really about
io_uring - it applies to all "async" solutions. Even today if you are running
async IO in userspace (e.g. using epoll), it's not very obvious where
something went wrong, because no task is seemingly blocked. If you attach a
debugger, you might most likely see something being blocked on epoll - but a
callstack to the problematic application code is nowhere in sight.

Even if pause execution while inside the application code there might not be a
great stack which contains all relevant data. It will only contain the
information since the last task resumption (e.g. through a callback).
Depending on your solution (C callbacks, C++ closures, C# or Kotlin
async/await, Rust async/await) the information will be between not very
helpful and somewhat understandable, but never on par with a synchronous call.

~~~
WGH_
> Even today if you are running async IO in userspace (e.g. using epoll), it's
> not very obvious where something went wrong, because no task is seemingly
> blocked.

It doesn't apply to file IO, which is never non-blocking, and can't be made
async with epoll. Epoll always considers files ready for any IO. And if the
device is slow, the thread is blocked with dreaded "D" state.

~~~
CodesInChaos
The fundamental problem is that readiness based async IO and random access to
not mix well. You'd need a way to poll readiness for different positions in
the same file at the same time.

Completion based async (including io_uring on Linux or IO completion ports on
Windows) doesn't suffer from this problem.

------
tyingq
There are some benchmarks that show io_uring as a significant boost over aio:
[https://www.phoronix.com/scan.php?page=news_item&px=Linux-5....](https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.6-IO-
uring-Tests)

I see that nginx accepted a pull request to use it, mid last year:
[https://github.com/hakasenyang/openssl-
patch/issues/21](https://github.com/hakasenyang/openssl-patch/issues/21)

Curious if it's also been adopted by other popular IO intensive software.

~~~
jandrewrogers
I have not adopted io_uring yet because it isn't clear that it will provide
useful performance improvements over linux aio in cases where the disk I/O
subsystem is already highly optimized. Where io_uring seems to show a benefit
relative to linux aio is more naive software design, which adds a lot of value
but is a somewhat different value proposition than has been expressed.

For software that is already capable of driving storage hardware at its
theoretical limit, the benefit is less immediate and offset by the requirement
of having a very recent Linux kernel.

~~~
shuss
For regular files, aio works async only if they are opened in unbuffered mode.
I think this is a huge limitation. io_uring on the other hand, can provide a
uniform interface for all file descriptors whether they are sockets or regular
files. This should be a decent win, IMO.

~~~
jandrewrogers
That was kind of my point. While all of this is true, these are not material
limitations for the implementation of high-performance storage engines. For
example, using unbuffered file descriptors is a standard design element of
databases for performance reasons that remain true.

Being able to drive networking over io_uring would be a big advantage but my
understanding from people using it is that part is still a bit broken.

~~~
g8oz
The ScyllaDB developers wrote up their take here:
[https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-
wi...](https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-
revolutionize-programming-in-linux/)

~~~
jabl
Those benchmark results are pretty impressive. In particular, io_uring gets
the best performance both when the data is in the page cache and when
bypassing the cache.

------
eMSF
A comment from the cat example:

>/* For each block of the file we need to read, we allocate an iovec struct
which is indexed into the iovecs array. This array is passed in as part of the
submission. If you don't understand this, then you need to look up how the
readv() and writev() system calls work. */

I have to say, I don't really understand why the author chose to individually
allocate (up to millions of) single kilobyte buffers for each file. Perhaps
there is a reason for it, but I think they should elaborate the choice.
Anyway, I guess the first example is too simplified, which is why what follows
after is not built on top of it in any way, hence they feel disjointed.

The bigger problem here is that I don't know the author, or how talented they
are. Choices like that, or writing non-async-signal-safe signal handlers don't
help in estimating it, either. Is the rest of the advice sound?

~~~
shuss
The author here: All examples in the guide are aimed at throwing light at the
io_uring and liburing interfaces. They are not very useful or very real-
worldish examples. The idea with this example in particular is to show the
difference how readv/writev work synchronously vs how they would be "called"
io_uring. May be I should call out the fact that these programs are more tuned
towards explaining the io_uring interface a lot more in the text. Thanks for
the feedback.

------
matheusmoreira
So awesome... The ring buffer is like a generic asynchronous system call
submission mechanism. The set of supported operations is already a subset of
available Linux system calls:

[https://github.com/torvalds/linux/blob/master/include/uapi/l...](https://github.com/torvalds/linux/blob/master/include/uapi/linux/io_uring.h)

It almost gained support for ioctl:

[https://lwn.net/Articles/810414/](https://lwn.net/Articles/810414/)

Wouldn't it be cool if it gained support for other types of system calls?
Something this awesome shouldn't be restricted to I/O...

~~~
diegocg
The author seems to be planning to expand it to be usable as a generic way of
doing asynchronous syscalls

------
ignoramous
Anyone familiar with the Infiniband's approach to exposing IO via rx/tx queues
[0] comment whether it seems similar to io_uring's ring-buffers [1]? How do
these contrast against each other?

[0]
[https://www.cisco.com/c/en/us/td/docs/server_nw_virtual/2-10...](https://www.cisco.com/c/en/us/td/docs/server_nw_virtual/2-10-0_release/element_manager/user_guide/appA.html#wp1007799)

[1]
[https://news.ycombinator.com/item?id=19846261](https://news.ycombinator.com/item?id=19846261)

~~~
DmitryOlshansky
Very limited experience with Infiniband but it seems similar, a bit more
flexible (esp recently with more syscalls supported).

Also similar to but more general than RIO Sockets of Win8+:

[https://docs.microsoft.com/en-us/previous-
versions/windows/i...](https://docs.microsoft.com/en-us/previous-
versions/windows/it-pro/windows-server-2012-r2-and-2012/hh997032\(v=ws.11\))

------
jmb001nyc
Question: how does one detect socket push back using io_uring? For example,
with libc "write/writev" for non-blocking socket would return less bytes than
requested and allow code to poll for write readiness before writing more. This
is quite useful to handle scenarios where there are impedance mismatches
between processing speed and ability to send data over a network, e.g.
processing needs to observe push back and handle it appropriately. Apologies:
I posted this question to twitter before I read the redirect here.

------
throw7
The site pushes really hard that you shouldn't use the low-level system calls
in your code and that you should (always?) be using a library (liburing).

What exactly is liburing bringing to the table that I shouldn't be using the
uring syscalls directly?

~~~
matheusmoreira
You absolutely can use system calls in your code. The kernel has an awesome
header that makes this easy and allows you to eliminate all dependencies:

[https://github.com/torvalds/linux/blob/master/tools/include/...](https://github.com/torvalds/linux/blob/master/tools/include/nolibc/nolibc.h)

This system call avoidance dogma exists because libraries generally have more
convenient interfaces and are therefore easier to use. They're not strictly
necessary though.

It should be noted that using certain system calls may cause problems with the
libraries you're using. For example, glibc needs to maintain complete control
over the threading model in order to implement thread-local storage. By
issuing a clone system call directly, the glibc threading model is broken and
even something simple like errno is likely to break.

In my opinion, libraries shouldn't contain thread-local or global variables in
the first place. Unfortunately, the C language is old and these problems will
never be fixed. It's possible to create better libraries in freestanding C or
even freestanding Rust but replacing what already exists is a lifetime of
work.

> What exactly is liburing bringing to the table that I shouldn't be using the
> uring syscalls directly?

It's easier to use compared to the kernel interface. For example, it handles
submission queue polling automatically without any extra code.

------
jra_samba
io_uring still has its wrinkles.

We are scrambling right now to fix a problem due to change in behavior exposed
to user-space from the io_uring kernel module in later kernels.

Turns out that in earlier kernels (Ubuntu 19.04 5.3.0-51-generic #44-Ubuntu
SMP) io_uring will not return short reads/writes (that's where you ask for
e.g. 8k, but there's only 4k in the buffer cache, so the call doesn't signal
as complete and blocks until all 8k has been transferred). In later kernels
(not sure when the behavior changed, but the one shipped with Fedora 32 has
the new behavior) io_uring returns partial (short) reads to user space. e.g.
You ask for 8k but there's only 4k in the buffer cache, so the call signals
complete with a return of only 4k read, not the 8k you asked for.

Userspace code now has to cope with this where it didn't before. You could
argue (and kernel developers did :-) that this was always possible, so user
code needs to be aware of this. But it didn't used to do that :-). Change for
user space is _bad_ , mkay :-).

~~~
jra_samba
It was really interesting how this was found.

A user started describing file corruption when copying to/from Windows with
the io_uring VFS module loaded.

Tests using the Linux kernel cifsfs client and the Samba libsmbclient
libraries/smbclient user-space transfer utility couldn't reproduce the
problem, neither could running Windows against Samba on Ubuntu 19.04.

What turned out to be happening was a combination of things. Firstly, the
kernel changed so an SMB2_READ request against Samba with io_uring loaded was
_sometimes_ hitting a short read, where some of file data was already in the
buffer cache, so io_uring now returned a short read to smbd.

We returned this to the client, as in the SMB2 protocol it isn't an error to
return a short read, the client is supposed to check read returns and then re-
issue another read request for any missing bytes. The Linux kernel cifsfs
client and Samba libsmbclient/smbclient did this correctly.

But it turned out that Windows10 clients and MacOSX Catalina (maybe earlier
versions of clients too, I don't have access to those) clients have a
_horrible_ bug, where they're not checking read returns when doing pipeline
reads.

When trying to read a 10GB file for example, they'll issue a series of 1MB
reads at 1MB boundaries, up to their SMB2 credit limit, without waiting for
replies. This is an excellent way to improve network file copy performance as
you fill the read pipe without waiting for reply latency - indeed both Linux
cifsfs and smbclient do exactly the same.

But if one of those reads returns a short value, Windows10 and MacOSX Catalina
_DON 'T GO BACK AND RE-READ THE MISSING BYTES FROM THE SHORT READ REPLY_ !!!!
This is catastrophic, and will corrupt any file read from the server (the
local client buffer cache fills the file contents I'm assuming with zeros - I
haven't checked, but the files are corrupt as checked by SHA256 hashing
anyway).

That's how we discovered the behavior and ended up leading back to the
io_uring behavior change. And that's why I hate it when kernel interfaces
expose changes to user-space :-).

~~~
jstarks
> in the SMB2 protocol it isn't an error to return a short read, the client is
> supposed to check read returns and then re-issue another read request for
> any missing bytes

This is interesting and somewhat surprising, since Windows IO is internally
asynchronous and completion based, and AFAIK file system drivers are not
allowed to return a short read except for EOF.

And actually, even on Linux file systems are not supposed to return short
reads, right? Even on signal? Since user apps don't expect it? (And thus it's
not surprising that io_uring's change broke user apps.)

So it wouldn't be surprising to learn that the Windows SMB server never
returns short reads, and thus it's interesting that the protocol would allow
it. Do you know what the purpose of this is?

~~~
jra_samba
Obviously the Windows SMB server never returns short reads, otherwise this bug
would never have made it out of Redmond or Cupertino.

On Linux, pread also never returns short reads against disk files if the bytes
are available, which is why no one noticed this client bug as our default io
backend is a pthread-pool that does pread/pwrite calls. It only happens when
someone tried our (flagged as experimental thank god) vfs_io_uring backend.

Yeah the protocol even has a field in the SMB2_READ request called
MinimumBytes, for which the server should fail the read if less than these
bytes are available on return. The Windows 10 clients set this to zero :-).
The MacOSX Catalina client sets it to 1. So yes, the clients are supposed to
be able to handle short reads.

~~~
jstarks
Out of curiosity, I took a look at how the MinimumBytes (actually
MinimumCount) field is used by the Windows SMB server. Interestingly, it fails
with STATUS_END_OF_FILE if the actual bytes read is less than MinimumCount,
which suggests to me that this is supposed to be a minimum on the (remaining)
file length, not on the number of bytes that the server is able to return at
the moment.

I can't find any history of MinimumCount being used in the RTM version of any
Windows SMB client, so without deeper archeology the reason this field was
introduced remains a mystery to me.

Regardless, I agree that the client should validate the returned byte count.
But (only having thought about this briefly), I do not think a client should
retry in this case--it seems to me if the client sees a short read, it can
assume that the read was short because the read reached EOF (which may have
changed since the file's length was queried).

~~~
jra_samba_org
Sorry to keep laboring the point :-) but the other reason I'm pretty sure this
is a client bug is that the client doesn't truncate the returned file at the
end of the short read, which you'd expect if it actually was treating short
read as EOF.

If you copy a 100mb file and the server returns a short read somewhere in the
middle of the read stream the file size on the client is still reported as
100mb, which means file corruption as the data in the client copy isn't the
same as what was on the server.

That's how this ended up getting reported to us in the first place.

~~~
jstarks
Yes, that's a good point. I agree that there appears to be a client bug here.
From a quick glance, it appears that nothing is checking that the non-final
blocks in a pipelined read are returned from the server in full.

I don't necessarily agree that retry is the right behavior though. Wouldn't
that result in an extra round trip in the actual EOF case? Again, not having
thought about this much, it seems a more efficient interpretation of the spec
is that truncated reads indicate EOF. In that case, a truncated read as in the
middle of a pipelined operation either indicate the file's EOF is moving
concurrently with the operation (in which case stopping at the initial
truncation would be valid) or the lease has been violated.

Regardless, I work on SMB-related things only peripherally, so I do not
represent the SMB team's point of view on this. Please do follow up with them.

~~~
jra_samba
It's only an extra round trip in the case of an unexpected EOF. File size is
returned from SMB2_CREATE and so given the default of a RHW lease then (a) the
lease can't be violated - if it is, then all bets are off as the server let
someone modify your leased file outside the terms of the lease. Or (b) you
know the file size, so a short read if you overlap the actual EOF is expected
and you can plan for it.

A short read in the _middle_ of what you expect to be a continuous stream of
bytes should be treated as some sort of server IO exception (which it is) and
so an extra round trip to fetch the missing bytes returning 0, meaning EOF and
something truncated or an error such as EIO meaning you got a hardware error
isn't so onerous.

After all this is a very exceptional case. Both Steve's Linux cifsfs client
and libsmbclient have been coded up around these semantics (re-fetching
missing bytes to detect unexpected EOF or server error) and I'd argue this is
correct client behaviour.

As I said, given the number of clients out there that have this bug we're
going to have to fix it server-side anyway, but I'm surprised that this
expected behavior wasn't specified and tested as part of a regression suite.
It certainly is getting added to smbtorture.

------
beagle3
Is there any intention to optimize work done, rather than just the calling
interface?

E.g., running an rsync if a 10m files hierarchy usually requires 10m
synchronous stat calls. Using io-uring would make them asynchronous, but they
could potentially be done more efficiently (e.g. convert file names to inodes
in blocks of 20k, and then stat those 20k inodes in a batch).

That would require e.g. the VFS layer to support batch operations. But the io-
uring would actually allow that without a user space interface change.

------
jcoffland
Maybe I just missed this but can anyone tell me what kernel versions support
io_uring. I ran the following test program on 4.19.0 and it is not supported:

    
    
        #include <stdio.h>
        #include <stdlib.h>
        #include <sys/utsname.h>
        #include <liburing.h>
        #include <liburing/io_uring.h>
    
    
        static const char *op_strs[] = {
          "IORING_OP_NOP",
          "IORING_OP_READV",
          "IORING_OP_WRITEV",
          "IORING_OP_FSYNC",
          "IORING_OP_READ_FIXED",
          "IORING_OP_WRITE_FIXED",
          "IORING_OP_POLL_ADD",
          "IORING_OP_POLL_REMOVE",
          "IORING_OP_SYNC_FILE_RANGE",
          "IORING_OP_SENDMSG",
          "IORING_OP_RECVMSG",
          "IORING_OP_TIMEOUT",
          "IORING_OP_TIMEOUT_REMOVE",
          "IORING_OP_ACCEPT",
          "IORING_OP_ASYNC_CANCEL",
          "IORING_OP_LINK_TIMEOUT",
          "IORING_OP_CONNECT",
          "IORING_OP_FALLOCATE",
          "IORING_OP_OPENAT",
          "IORING_OP_CLOSE",
          "IORING_OP_FILES_UPDATE",
          "IORING_OP_STATX",
          "IORING_OP_READ",
          "IORING_OP_WRITE",
          "IORING_OP_FADVISE",
          "IORING_OP_MADVISE",
          "IORING_OP_SEND",
          "IORING_OP_RECV",
          "IORING_OP_OPENAT2",
          "IORING_OP_EPOLL_CTL",
          "IORING_OP_SPLICE",
          "IORING_OP_PROVIDE_BUFFERS",
          "IORING_OP_REMOVE_BUFFERS",
        };
    
    
        int main() {
          struct utsname u;
          uname(&u);
    
          struct io_uring_probe *probe = io_uring_get_probe();
          if (!probe) {
            printf("Kernel %s does not support io_uring.\n", u.release);
            return 0;
          }
    
          printf("List of kernel %s's supported io_uring operations:\n", u.release);
    
          for (int i = 0; i < IORING_OP_LAST; i++ ) {
            const char *answer = io_uring_opcode_supported(probe, i) ? "yes" : "no";
            printf("%s: %s\n", op_strs[i], answer);
          }
    
          free(probe);
          return 0;
        }

~~~
cesarb
If you have a clone of the Linux kernel source tree, you just have to look at
the history of the include/uapi/linux/io_uring.h file. From a quick look here:
everything up to IORING_OP_POLL_REMOVE came with Linux 5.1;
IORING_OP_SYNC_FILE_RANGE was added in Linux 5.2; IORING_OP_SENDMSG and
IORING_OP_RECVMSG came with Linux 5.3; IORING_OP_TIMEOUT with Linux 5.4;
everything up to IORING_OP_CONNECT is in Linux 5.5; everything up to
IORING_OP_EPOLL_CTL is in Linux 5.6; and the last three are going to be in
Linux 5.7.

~~~
jcoffland
This article concurs.
[https://lwn.net/Articles/810414/](https://lwn.net/Articles/810414/) io_uring
was first added to the mainline Linux kernel in 5.1.

------
rwmj
By coincidence I asked a few questions on the mailing list about io_uring this
morning: [https://lore.kernel.org/io-
uring/20200510080034.GI3888@redha...](https://lore.kernel.org/io-
uring/20200510080034.GI3888@redhat.com/T/#u)

------
qubex
Unfortunately I misread the title as “Lord of the Urine” and... was concerned.

