
Io_submit and Linux AIO – An epoll alternative - majke
https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/#
======
quotemstr
The lack of clear vision in Linux system call design is a source of continual
frustration. I feel like any time I or anyone else proposes generic ways to
solve certain longstanding problems, the response is a hostile sort of "no"
based on, ostensibly, lack of demonstrated use cases for genericity, but in
actuality, I think, simple change aversion. The result is an accumulation of a
pile of ad-hoc kludges and crud that do one thing or not the other (e.g., the
batch-but-not-async and async-but-not-batch distinction the article mentions).

There needs to be more openness to providing _general_ interfaces that solve
whole classes of interface problems all at once, and without the technical
conservatism that confronts people trying to contribute new interfaces.

~~~
cyphar
I think this is based on a fear of having an API that wasn't needed and needs
to be maintained forever. I bet that Linux's syscall design would be much
nicer if iterating on APIs was possible (as we see in other Unix-likes).

My personally least favourite syscall design misstep is how VFS flags work.
AT_ flags were supposed to be generic but we now have AT_STATX_, O_ flags are
overused, and some at(2) syscalls like renameat2(2) have their own brand of
flags (RENAME_). Internally VFS has uniform flags for lookup and other various
things, so it's a bit silly that these flags are all basically incompatible.

(Also, the fact that O_RDWR != O_RDONLY|O_WRONLY is just silly. It takes up
just as many bytes but makes life needlessly difficult when designing new
extensions to openat(2).)

~~~
quotemstr
If I had my way, the application stable ABI would be libc, not SYSENTER.
Windows gets this right and is able to iterate faster on kernel interfaces as
a result.

A while ago, I proposed a kernel supplied libc add-on that would provide the
syscall wrappers that glibc won't. Maybe we could declare that all
applications must make new system calls through this interface and maintain
syscall compatibility only for the current legacy set of system calls.

~~~
loeg
Pros and cons to the libc ABI approach. It works very well for things that are
C or mostly C-compatible (C++, maybe Rust, maybe other more esoteric
languages), or interpreters with a strong grounding in C (Python; Ruby;
historically, Java). It doesn't mesh well with Golang's choice to avoid
linking libc, which in practice probably only works out because Linux
maintains syscall ABI stability forever.

~~~
quotemstr
If Golang decides to avoid the platform interface library, any resulting
problems are on Golang. I totally get Go's "not C" ethos, but there's no
incompatibility between being not C and making system calls the way that
programs are expected to make them on a platform. Go is _flagrantly_ and
_needlessly_ incompatible with a lot of things, and there's no functionality
or performance rationale for it.

There is absolutely nothing about the Go language or runtime environment that
excuses them calling into libc to do platform things. Nothing. Every other
managed language environment does it, and Go can too.

~~~
johncolanduoni
The big reason on Linux is libc symbol versioning. You can inadvertently link
to newer versions of glibc symbols if you don’t build on a platform using the
oldest libc you support. Go binaries have the advantage of being able to run
on any Linux kernel they support, regardless of what distro and version they
were built on.

With Rust I usually end up having to compile a bunch of build-time
dependencies from source on my build images because of this, instead of just
using a version of Debian etc. new enough to have the right dependencies in
the package repo.

However I don’t think Go ‘s approach makes sense on some other platforms (e.g.
macOS) that don’t suffer from this issue. In those cases you sacrifice forward
compatibility for no real benefit.

~~~
GordonS
Yep, we have an ancient Ubuntu 7.04 build server for exactly this reason -
linking against an old libc.

This is how we've done it for something like 10 years now, and I've never
thought about it since it works - but is there another approach we could use
to link against an older libc version from a newer Ubuntu version?

~~~
viraptor
Does the old libc not compile on the new system anymore? You could install it
on the side, link to it, and fix up paths if needed.

------
majke
I’m trying to convince Antirez to use this technique in redis:
[https://twitter.com/antirez/status/1081197002573139968](https://twitter.com/antirez/status/1081197002573139968)

> There was a problem with that… I don’t remember what exactly, but it was
> like, the structures you fill did not match how Redis stored the data or
> something. I need to try again soon or later because the speedup in Redis
> would be huge.

~~~
drewg123
I wonder if Linux has considered the technique that FreeBSD uses for async
sendfile?

The FreeBSD sendfile() syscall is async (mostly; there is a potential to block
reading metadata in VOP_BMAP). It launches a request to read disk blocks into
pages, and attaches the pages to the socket buffer via mbufs, but marks them
M_NOTREADY so that the protocol layer cannot send them. When the disk I/O
completes, the mbufs are marked ready in the context of the disk interrupt
thread, and then the proto output routine (eg, tcp_output()) is called.

This is more-or-less fire and forget for nginx. (yes, it does check the status
via kqueue, but it does not block)

This was a huge performance win for us at Netflix, and the not-ready concept
is the basis for our kernel-tls layer (mbufs are left not-ready, and encrypted
after the disk read, and then the crypto thread calls the proto output).

~~~
loeg
async sendfile is highly specialized for ... sendfile. It'd be nice to have a
more general approach in FreeBSD as well as Linux. Honestly, NT has a great
model here we could adopt.

------
polskibus
Can anyone compare this to IO Completion Ports on Windows?

------
adontz
After I've read the article which kind of praises AIO, I have impression that
AIO is misdesigned, have misleading name and overall quite terrible. I better
stay away from it.

Also, epoll is good enough for me, my tasks. So no actual need to jump on the
code to replace my epoll event loops.

~~~
ahartmetz
Apart from the ugly API (including no glibc support), the most ridiculous part
is that it only works asynchronously with XFS and without caching! Without
caching! In an API that is supposed to improve performance!

I think we just need I/O readiness like epoll for disk I/O. AIO for disk seems
to use the completion model ("do it and tell me when you're done" like
Windows), which has higher maximum performance, but is more difficult to use /
integrate. So you can now do both disk and network I/O with AIO... but it's
going to be readiness based for network and completion based for disk. What
the actual fuck.

Edits:

Thinking about it, readiness based reading from disk only works if you tell
the system up front where to read and how much (not necessary for network
sockets), which is new API that you have to use. Reading must be done when the
disk is ready and can't wait for you, so it needs another buffer and memcpy.
Readiness based disk writing also kind of works and it's easier - the system
tells you when the write buffer has room again (threshold selectable or
fixed), but submitted data is not going to be written right away. This might
be the semi-reasonable explanation for the strange O_DIRECT limitation of AIO
- if you need async buffered writes, just let the write cache do its work
(and, you know, randomly block sometimes - never mind reading).

So readiness-based disk I/O could work with reasonably user-friendly new API
for reading and not much worse performance loss than readiness-based network
I/O. No idea about how much work it would take in the kernel.

It would also be nice to have completion-based network I/O with AIO for
symmetry reasons and in case you really need the best performance. Though I
have never worked on something where completion-based network I/O would have
made a meaningful performance difference.

~~~
phs2501
The problem with readiness-based IO for disk is that without starting the
request it doesn't know where it needs to read on the disk to fetch, so it
can't possibly tell you if the data's ready. If you were to limit file IO to
streaming rather than random-access it could potentially work, but that's not
the POSIX interface. Since pipes and sockets are streams they don't have this
issue.

I'm not defending the suckiness of the AIO situation on Linux, but I
understand why it is to some extent.

~~~
loeg
At a minimum it could tell you if the first byte at the fd offset is in cache,
which it does not today. That'd at least get you a filesystem block (or more
likely, 4k page). I agree this is less useful than, say, the full generality
of NT IO completion.

~~~
the8472
Linux has fadvise to request an async read and preadv2(..., RWF_NOWAIT) to try
to get things from the page cache. All that's missing now is a notification.

~~~
loeg
That's the key distinction between polling and select-like waiting.

