Hacker News new | past | comments | ask | show | jobs | submit login
Getting Hands-on with io_uring using Go (mattermost.com)
140 points by rbanffy 57 days ago | hide | past | favorite | 22 comments

Another interesting post on io_uring: https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-wi...

disclaimer - I used to work at ScyllaDB

Discussion of supporting this in the Go runtime is here:


TL;DR there's no ETA for it.

The JVM project Loom will seamlessly enable io_uring and restartable sequences for Async IO

Just from taking a look, it seems like there are two use cases where one would really want to use this:

1. Programs that do a lot of IO to and from regular files, and that don't want to bother with a thread pool. For this reason I would expect to see it land in the golang runtime, and in other event loop implementations like libuv.

2. Legitimately IO bound programs, which can use SQ polling. Other users of aio fall in this category, as well as qemu.

For everything else, it looks like epoll is still mostly an equivalent choice to io_uring? Has anyone got any benchmarks for using io_uring in a typical network daemon, i.e. something that would generally be bound by socket I/O?

Lots of things that don't seem like heavy IO operations can still incur high walltime costs, e.g. stating all files in a directory tree. Since each stat is a syscall executing all of them sequentially is clostly. io_uring can be used for batching these kinds of things even when your code isn't built around asynchronous execution.

And with the 5.7 changes you can do polling + buffer selection + reading for any number of sockets with a single syscall[0].

[0] https://lwn.net/Articles/815491/

>stating all files in a directory tree

That does seem useful. But wouldn't you want to put this into a library that falls back to a thread pool approach on older kernels? And in that case, it seems like the application doesn't particularly care what the underlying implementation is? Since it's not built around asynchronous execution it just calls some function that blocks until the CQ is complete. This appears to be what the golang runtime will have to do.

>And with the 5.7 changes you can do polling + buffer selection + reading for any number of sockets with a single syscall

Thank you, this is very close was what I was looking for. (It hasn't made its way into the manpages yet)

> That does seem useful. But [...]

That was mostly meant as an example of the more general case where using io-uring can reduce overhead of any kind of IO operation, even when they're not the primary focus of your application.

The point is that yes, io-uring is great for event-based, asynchronous libraries. But even traditional synchronous code can make significant gains by switching to it. As the article hints with its package name: io-uring is the one ring to bind them all.

> This appears to be what the golang runtime will have to do.

The go standard library can tie io-uring into its goroutine scheduler. When doing IO it suspends the green thread, submits the work to IO uring and polls the completion queue of the ring when looking for tasks that need to be woken up.

Pretty much everything touches the disk at some point. A lot of those things are asynchronous daemons these days. They currently need a complex thread pool to handle disk I/O - io_uring means they can run disk I/O on the same thread, and all they have to do is some buffer management. It’s a lot simpler and cleaner a programming model.

Source: recently developed a web server that uses io_uring.

Can you share a link to that web server?

Right now, unfortunately not, but I'll be releasing it open-source within the month. Look out for a Show HN post on the subject of live video streaming.

The way I see it, io_uring has the potential to basically become the only way that high-level programming languages and runtimes talk to the kernel for I/O. If you're working with I/O through an abstraction, I don't see any reason why your implementation wouldn't use io_uring, and lots of reasons why it should.

> Has anyone got any benchmarks for using io_uring in a typical network daemon, i.e. something that would generally be bound by socket I/O?

Author here. Some folks have been writing echo server implementations in io_uring and doing some benchmarks: https://github.com/CarterLi/io_uring-echo-server.

May not be your typical network daemon, but you can still look at the relative gains.

Generally speaking, io_uring is just a general-purpose programming model to interface with the kernel. Socket I/O is just one part of the story. But when combined with other things, it becomes much more powerful.

Any asynchronous event-loop program that touches files wants something like this. Unix file IO is classically synchronous; the hack around this was to run a userspace threadpool. It's unpleasant. The Unix "AIO" interfaces all kinda suck.

Unfortunately you're still stuck with having to run a thread pool if you want to do something that has no async version, such as reading a directory. I wish it went further!

It continues to expand; it'll get there.

If it avoids context switches and memory copying, or even reduces them, it could be a big win for high throughput networking stuff.

Great article! I now finally understand what io_uring is about. IO with no syscalls - that is clever.

I've been confused by the name for ages. I now can parse it as IO Userpace RING, rather than some sort of typo!

Looks like it only support files right now, could io_uring work for network FDs? If so, can it reach performance close to kernel bypass (eg. DPDK) ?

Yes, io_uring supports network io. Kernel 5.5 added support for accept, but you have to wait till 5.7 before you can string together an accept followed by a read(v). I still haven't seen much in the way of hard numbers, yet, but theoretically it should be darn quick. If you're using preregistered buffers there's no copying needed, and if you ask the kernel nicely it will spin up a thread to monitor the submission queue so you don't even need syscalls after setup.

Author here:

To add to what couchand said, there are already comparisons of io_uring with SPDK and they are pretty close. So I would expect near about the same result for DPDK too.


> Fortunately, there’s an age-old solution to this problem - ring buffers. A ring buffer allows efficient synchronization between producers and consumers with no locking at all.

I don't think this is correct--ring buffers still require a mutex to prevent the producer from updating the write pointer while the reader checks the write pointer.



As long as you have a reasonably modern CPU, you can use one of the various Compare-And-Swap (CAS) instructions to implement properly lock-free FIFO queues. There are a few variations but a circular array is one of the possibilities. We can probably discuss for quite some time if it's entirely lock-free if the lock is implemented in hardware but I think we can agree it's not a mutex :)

The people over at enqueuezero have a nice article about lock free queues at https://enqueuezero.com/lock-free-queues.html.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact