
Two new ways to read a file quickly - chmaynard
https://lwn.net/SubscriberLink/813827/5a28c7fcd5ffadbb/
======
adrianmonk
I'm not a kernel hacker, but the Linux kernel now has an in-kernel VM (BPF),
doesn't it? As I understand, it started with letting you load programs into
the kernel to filter network packets, but now it allows several other things.

So, readfile() combines 3 syscalls into 1 by manually implementing a new
system call. Would another approach be to make these same calls (openat(),
read(), and close()) and some others available to BPF programs?

The idea is to provide a more general mechanism, rather than solving this one
case. Obviously, the kernel change would need to be feasible, and the result
would need to be safe and fast.

Then a user mode program could effectively just write its own readfile(), or
some other variation in case that exact sequence of calls isn't what it needs.

(An even more out-there idea is to have the kernel auto-detect situations like
this, examine the native code, then after verifying that it's simple and
harmless, just move a whole chunk of code or tight loop into the kernel. In
effect, it would dynamically generate a new syscall.)

~~~
hinkley
I remember a paper decades ago now about a way to reduce RPC call round-
tripping by sending something akin to promises back and forth, so you could
pass the result of one remote call to another one without transferring it
across the network. Kinda think something similar might be useful here.

What typically ends up happening in “services” (or any distributed system) is
that you start creating convenience APIs to speed things up that are mostly
just combinations of other calls, and you stay with the ones that seem like
they might be the most useful, so you don’t slam into combinatorics face
first.

Given the amount of talk I’ve heard about system call overhead lately, it
sounds like the magnitude of the delays is creeping up again (especially post-
Meltdown).

~~~
exrook
I don't know if this is what you were thinking of, but capnproto[0] does
exactly that with a promise like interface.

Actually this may be the paper you were referencing, from the bottom of the
capnproto page:

> Cap’n Proto’s RPC protocol is based heavily on CapTP[1], the distributed
> capability protocol used by the E programming language[2]. Lots of useful
> material for understanding capabilities can be found at those links.

[0] [https://capnproto.org/rpc.html](https://capnproto.org/rpc.html)

[1]
[http://www.erights.org/elib/distrib/captp/index.html](http://www.erights.org/elib/distrib/captp/index.html)

[2] [http://www.erights.org/index.html](http://www.erights.org/index.html)

~~~
hinkley
Very interesting, but if memory serves, Microsoft had this idea maybe in ‘96.
Though patent expiry-wise, the timing of cap’n proto might make a bit of
sense.

~~~
barrkel
E is from '97 and built on top of ideas implemented before that with
extensions to Java; not sure of the date, but could easily predate '96.

~~~
hinkley
Java was still pretty brand spanking new in ‘96. And RMI wasn’t inflicted on
us until ‘97. But half the stuff we use was invented in the 70’s and explored
in the 80’s anyway.

The classes in college that I have used the most by far are the series that
covered logic and set theory, and the one on distributed computing. The latter
does take some of the fun out of people rediscovering things though.

IIRC, Berkeley had an OS with swarm computing and process migration around
‘87. It wasn’t until 20 years later that the speed of disk vs networking had a
similar imbalance again. And of course the mainframe guys must constantly
think the rest of us are idiots.

~~~
barrkel
Oh I know, I started programming in '92 or so.

The idea of creating a program and sending it across an abstraction boundary
for execution to avoid the cost of continually navigating the abstraction
boundary for each step in the program is reasonably trivial. It was an
approach I used in my first job for optimizing access to data in serialized
object heaps - rather than deserialize the whole session object heap for each
web request (costly), I'd navigate the object graph in the binary blob, but
this required following offsets, arrays etc. The idea of a 'data path' program
for fetching data occurred almost immediately.

In the end a different approach using very short-lived objects which were
little more than wrappers around offsets into the heap combined with type info
turned out to be more usable, and with generational GC, was plenty fast too;
the abstraction stack was thin, the main cost avoided was materializing a
whole heap.

------
correct_horse
If the problem is slow switching between user space and kernel space, why not
have a syscall that allows you to batch process multiple syscalls?

It might work like this: user space allocates a buffer, fills it with syscall
numbers and arguments, then passes it to the kernel. The kernel executes each
syscall in series, placing the return values in the array originally
containing arguments. This syscall would not return to user space until all
syscalls in the buffer are processed.

Does this idea have any potential?

~~~
koolba
The problem is that the calls are related. The first gives you the file
handle, the second reads it, the third closes it. You need the handle in
advance to queue the second and third.

This idea reminds me of pipelined Redis commands. It does work to boost total
throughout, but it’s significantly more complicated to use in practice.

~~~
jzwinck
Imagine a method like this:

    
    
        1. Enqueue A = openat(...)
        2. Enqueue B = read(A, buf, sz)
        3. Enqueue close(A)
        4. Commit, return B to userspace
    

A and B are placeholders, and A is never known to userspace, it is simply
chained within the enqueued commands.

~~~
danobi
What if openat() or read() fails? More generally, what if there's a chain of
syscalls and you _want_ one of them to fail?

~~~
JoshTriplett
With the "specific fd" approach, if openat fails, then read and close will
harmlessly fail with EBADF ("Bad file descriptor"). Once you process the
failure of openat, you'll just ignore the failure of read and close.

~~~
yxhuvud
What will happen though, if the fd is already used by some other area of
userspace? Like what if some lib you link to also happen to hardcore the use
of the same fd. Sounds like unfunny problems to debug.

In the case of uring I suppose it is possible to link the calls to fail at the
first failure.

~~~
JoshTriplett
> What will happen though, if the fd is already used by some other area of
> userspace?

That's what the min_fd patch is for, and potentially other systems for
reserving blocks of fds. Much like memory, you'll ask the kernel for a block
of it and then do your own allocations out of that.

~~~
yxhuvud
Ah, it will be possible to allocate multiple blocks of file descriptors? Ok,
then things start to make sense.

------
mayoff
This article says `readfile` would return the number of bytes read. It seems
like it would be better for it to return the number of bytes in the file, but
only write at most `bufsize` bytes into the buffer. That way, if the buffer
isn't big enough, you can allocate a correctly-size buffer without needing to
`stat` the file. This is how `snprintf` works.

~~~
im3w1l
I think some "files" that are actually devices or pipes and stuff, for them
the number of bytes in the file doesn't make sense, while bytes read does make
sense. That may be why.

------
yxhuvud
O_SPECIFIED seems like a strange solution to me.

Sure, I see why the kernel people wouldn't mind having people specify their
own fd numbers, but doing the accounting necessary for that (to avoid
conflicts) in userspace doesn't seem nice. Especially when there will be
libraries that also may make use of the functionality.

Is there even any way to see what file descriptors that are open without doing
a syscall?

~~~
StrangeDoctor
It looks like it’s intended to be used with a batched creation of file
descriptors that are known to allocated but not yet used (to cut down on total
syscalls?) [https://lore.kernel.org/io-
uring/20200211223235.GA25104@loca...](https://lore.kernel.org/io-
uring/20200211223235.GA25104@localhost/)

At least I think I’m reading this correctly.

------
willvarfar
Are there any mainstream databases etc adopting io_uring yet?

I first heard about io_uring here on HN and a rust DB lib called Sled was used
as an example.

But when do we get fast MySQL etc?

I recall a paper from 2012 or so that implemented a PoC syscall buffer so a
bunch of syscalls could be scheduled and executed sequentially with just one
syscall. They claimed 40%+ performance improvements in various programs,
including MySQL. This was of course before meltdown and other cache
mitigations were a thing.

So programs, including MySQL and Postgres (hey, I know something about various
DB storage engines), would benefit massively from scheduled sequential
syscalls (easy to adopt, big win) and true async io (io_uring, harder to
adopt, even bigger win).

~~~
antpls
At that point, wouldn't it be simpler to run mysql or postgresql on their own
bare metal VM with a specialized uni-kernel that only runs the db process very
efficiently (without notion of kernel space and user space and without context
switching) ?

------
tyingq
_" On its face, readfile() adds nothing new; an application can obtain the
same result with calls to openat(), read(), and close(). But it reduces the
number of system calls required from three to one"_

A little surprised they didn't also roll in an optional lseek() offset in case
you wanted a specific area of the file. Then it would roll 4 syscalls into 1.
The somewhat similar sendfile() accepts an offset.

~~~
masklinn
If you look at the thread, readfile() would be pretty specifically for
extremely short files e.g. procfs and sysfs.

It would _not_ be intended as a convenience API for slurping an entire file
but as a performance API for very short reads (straight open(); read();
close() where the syscall overhead is enormous).

The implementation(s) by Greg Kroah-Hartman don't even bother looping until
read returns empty or the buffer is full.

~~~
tyingq
Sure, but it would be simple to add, and perhaps then useful for lots of small
reads from bigger files. Or things like /proc/pid/mem.

------
trgfs85
A lot of comments in this topic seem to forget that syscalls don't always
succeed. Interupts are the biggest issue. If an interrupt occurs and you read
something you get a partial read. If you open something you tend to get
EINTER(very system dependent.) You can't really queue those. And those
shouldn't be the OS's job to deal with.

------
cryptonector
> the cost of reading a lot of little files would be too high

The `readfile()` system call will reduce that by some factor, though probably
not as high as a factor of one third (it replaces three system calls with two,
assuming it turns out to be useful). But getting all the data in one call will
be even faster.

Incidentally, a typical read-all-of-this-file function in user-land allocates
a buffer of the right size after doing an `open()` and then `fstat()` to find
the file's size. Whereas a system call to read a file in one go really can't
do that unless it establishes a memory mapping (which is... expensive). So I
rather question the utility of this thing. Also, since one might not know what
size buffer to give it, it might as well have an option to return an open FD.
Or the app should have to `stat()` then `readfile()`, thus replacing three
syscalls (`open()`, `fstat()`, `close()` with two -- bummer. Or the caller can
`readfile()` repeatedly if the buffer given was not large enough (oof).

No, I'm pretty sure this `readfile()` can't possibly make reading lots of tiny
files fast enough to not want a better solution for getting mounted filesystem
metadata out of the kernel.

~~~
ploxiln
One of the cases posited is `ps` or `top` which (on linux) read
/proc/[0-9]+/stat and /proc/[0-9]+/cmdline, and in these cases it knows the
the max it will want to read (which is pretty small, maybe 512 bytes at most).
So this can make `ps` or `top` 3x faster / more efficient.

~~~
bicolao
There must be other use cases besides ps and top to justify new syscalls
right? ps and top are not exactly slow to even make these optimizations
visible.

------
ncmncm
This sort of thing leads me to think Linux is well down the path to
irrelevance. Io_uring, for all its baroque elaboration, saves millions or
billions of system calls. Bloating the kernel to save two of three system
calls on filesystem operations seems nothing short of foolish.

I would welcome being proved wrong.

------
userbinator
The idea of making file reads stateless is interesting. I wonder if it's
atomic too? I think that might be very useful to have.

~~~
loeg
It does sort of raise the question of how interrupts are handled. Short reads
/ EINTR and suddenly this mechanism is less efficient than open+read+close.
Depending on how SA_RESTART is implemented, it may not help. Do consumers have
to mask signals around readfile(), and if so, are we saving any system calls?

------
longtermd
TLDR: speed-up happens as common-combinations of syscalls (open, read, close)
are combined to one single syscall (read file contents).

It violates the unix philosophy of "one call for one specific thing", of
course, but massively speeds up performance in practical applications.

In fact, you could get a LOT of speedup at Linux/Microsoft/... (the OS level)
by figuring out highly common combinations of syscalls typically all called in
a row with mostly the same arguments, and combining them into one single
syscall.

~~~
wahern
Once upon a time Linux solved the "syscalls are slow" problem by making
syscalls fast, rather than introducing convoluted APIs as a workaround, like
other operating systems did.

Perhaps one day we can get back to that model, or maybe now that Linux has
gone corporate those days are gone. Even when you can make something faster
with a simple, straight-forward solution, there's always that one corporate
sponsor who doesn't benefit. Before Linux would just tell them sod-off.

~~~
KarlKemp
that’s (at best) a gross oversimplification of the process, idealizing the
past while ignoring how expectation for the kernel have increased (I e.
security & privacy), all colored by a layer of bad faith that, to be frank,
seems to originate with some non-specific grievances.

~~~
wahern
I'm simply channeling Linus Torvalds. Here's a sample of a rant/boast from
2000 where Linus defends the "heavy-weight" threading approach against claims
that a proper user-space threading architecture could provide better
performance by requiring fewer syscalls and less memory:

> Yes, you could do it differently. In fact, other OS's _do_ do it
> differently. But take a look at our process timings (ignoring wild claims by
> disgruntled SCO employees that can probably be filed under "L" for "Lies,
> Outright"). Think about WHY our system call latency beats everybody else on
> the planet. Think about WHY Linux is fast. It's because it's designed right.

Source:
[https://lkml.org/lkml/2000/8/25/80](https://lkml.org/lkml/2000/8/25/80)

The same architectural approach was taken with fork: Linux optimized the heck
out of fork to the point where fork on Linux was faster than thread creation
on Windows. On multiple occasions Linus has argued (and proven), that
optimizing the simple but "heavy-weight" approach can reap dividends at least
as great as more complex, more flexible architectures.

Of course, that was then and this is now. And many things have changed,
including the success of Linux. I'm just suggesting, or perhaps implying, that
there's more eagerness to entertain the replacement of traditional syscall
semantics and other subsystems with more complex frameworks than there once
used to be. In the context of Spectre, arguably it would have been easier for
a younger Linus (and younger Linux) to refuse to rearchitect syscalls (with
the concomitant additional kernel and semantic complexity), to tell everyone
to sit tight and wait for AMD and Intel to come up with mitigating hardware
fixes, and in the intervening years just take the performance hit.

------
nibbula
I'm not a kernel hacker either, and even though a readfile system call might
speed some things up, one of the reasons it's desired is really wallpapering
over a deeper, much harder issue, that of it being a terrible hack to write
the 'ps' command, and the more general issue of returning structured data from
a unix-like kernel. Also it seems like a bad design not to know how big to
make the buffer, which means if you want to reliably get the whole file you
either have to put it in a loop reallocating the buffer, or do a stat system
call before be lucky with the race condition.

To be fair, this has been an issue since the first unix kernels, and even
though /proc and sysctl are an improvement over grepping kernel memory from
user space (the incredibly hackish way old ps worked), in my opinion it's
still a big mess in various ways.

Just in case you aren't aware, linux ps/top/etc. has to open possibly hundreds
of fake files, parsing from text a bunch of stuff which may or may not be
there or valid or the same data type depending on your kernel configuration or
version. Just because you could write 'ps' with awk, doesn't make it good. I
wouldn't object at all to the /proc interface, as a way of enabling simple
tools, if there was also a decent function call interface.

Modern BSDs generally use sysctl, which although being better in theory, that
you don't have the overhead of useless translating numbers back and forth from
text and looking for space characters, it still has the problem that it's very
dependent on subtle changes in the C data structures which can easily happen
between version and architectures. But it also has the terrible drawback of
not knowing how big the buffer should be and therefore having to be in an
allocation fail loop, probably exactly when you don't want it to happen, when
you system is overloaded with way too many processes. I really don't see why
one couldn't pass a memory allocator function, which could be called in the
same manner as a signal handler.

I don't know how windows stuff works inside, but to get process information
you ask for a snapshot of the system from which you can return a set of
handles to processes, which you can get information from. In other words, it
allocates the appropriate things for you and has well defined set of
information which you can query, all with a relatively simple function call
interface from C. I wrote a 'ps' that works on linux/bsd/macos/windows, and
even though I'm not at all fan of windows, the windows kernel does it better
than all the unix kernels.

It's actually not that hard to make an acceptable C interface to to say
process information, if you ignore the problems of data structure variance.
But ignoring it is really the root of the issues, so it seems fairly
pointless. sysctl sort of tries to have a metadata system, but it doesn't
really address the data type problem well, and can end up using C structs from
the kernel with same variance problems. For example task_struct in linux or
struct proc on macos is kind of a big mess, but very important. You wouldn't
want to restrict it from changing in any way, but you do want to get some of
that data to users.

Making a good C metadata interface is complex, and although it's been done
many times, it's quite tricky to do well, and you would really want a kernel
interface to be done very well. I like to imagine that there could be
something less bloated than gtk gobject, but more featurefull than sysctl.
Unfortunately when I look at code involved I start feeling like C isn't really
a good language for writing an operating system. But I still think that a well
designed structured meta data system would be overall less overhead than /proc
and could even achieve better speed and reliability.

------
shiblukhan
Hmm, how is one actually supposed to use O_SPECIFICFD in real life though? I
mean, picking some specific FD nr can work in trvial, single threaded programs
maybe, but how is that supposed to work on typical generic userspace code that
has many threads and many libraries/subsystems all running in the same address
space (and fd space) that want to take possession of some fd? I mean are apps
supposed to block fd ranges ahead of time, by dup()ing /dev/null a couple of
times, or how is that supposed to work? not getting this...

------
ape4
Maybe the new readfile() could be used to implement php's readfile().

~~~
131hn
PHP readfile is all about « streaming » GB size files to stdout, not so much
returning a whole file in a single buffer (but file_get_contents is and might,
indeed, benefit from the new readfile syscall )

~~~
ape4
Thanks, that was what I meant.

