
Linux: IO without entering the kernel - pplonski86
http://git.kernel.dk/cgit/linux-block/commit/?h=aio-poll&id=5aeaa1ad235c708e31ad930d1ff6ba6fd39bee91
======
termie
This is promising, but for dedicated high-performance I/O I am much more
interested in how the SPDK is progressing. Avoiding syscalls is great, but
without any benchmarks I wonder about the benefits over mmap and vectorizing.
You still have to contend with locking, interrupts, copying, and ring 0
abstraction impedence.

[1] [https://spdk.io](https://spdk.io)

~~~
jandrewrogers
SPDK follows the same pattern as DPDK (for networking) but it has much less of
a practical use case in my experience. You can easily outperform mmap() etc
using io_submit() interface that already exists, saturating the capabilities
of fast storage hardware, and the overhead of io_submit() is nominal if you
are using it well. SPDK won't meaningfully increase your I/O throughput and
the API is significantly more difficult to use. I've played around with SPDK
in database storage engines and I find it to be inferior for the purpose
compared to io_submit() in practical implementations. The number of IOPS
required to make SPDK worth considering would be indicative of a more
fundamental design flaw in your storage architecture, at least for the
foreseeable future.

DPDK, by contrast, can dramatically improve networking throughput and
performance. While also not the most friendly API to use, it undeniably
improves CPU efficiency for packet processing and without an obviously better
alternative.

~~~
devwastaken
Do you have a high performing io_submit() example?

~~~
jandrewrogers
A high-performance io_submit()/O_DIRECT implementation looks similar to a more
explicit reimplementation of mmap() and friends, so the performance does not
come from the API per se but what it enables.

The primary difference is that manipulating/measuring the fine-grained runtime
internal state of mmap() is difficult/expensive, whereas with io_submit() you
have almost perfect visibility into and control of the entire internal state
of your ersatz mmap(). With io_submit() you never block on page faults or
write backs, since your scheduler explicitly controls when they happen and
knows when they complete. Admittedly there is a large implementation gap
between the io_submit() APIs and a high-performing mmap() replacement.

Throughput in database-y software is largely driven by the effectiveness of
the scheduling -- doing exactly the right thing at exactly the right time. The
fine-grained awareness and control of your instantaneous disk I/O and cache
state makes it possible to build efficient schedules with io_submit().

With mmap() you are never entirely sure what is going under the hood and Linux
will often decide to do suboptimal things at suboptimal times, or choose to
ignore you -- most control of mmap() behavior is advisory only.

------
baybal2
A little bit of a misnomer. You have to mention that IO and driver interaction
is still being done in kernel. Just intermediary buffers are remaining MMAPed
in both kernel and userspace.

That's a step forward from syscalls and excessively frequent context
switching, but it is not a real case of userspace driving hardware directly.
The later still holds a lot of promise, but it will require hardware to be
more smart on its own, and be able to support access concurrency.

Something like that is already being done by hardware supporting virtualisable
I/O

~~~
FullyFunctional
This is what DSSD did since ~ 2014; drive NVMe directly to the D5 from user
app, never incurring a context switch (in fact the D5 was mapping into memory
on all the PCIe connected clients). Latency was astonishingly low, dominated
by PCIe switches for writes and NAND reads for, eh, reads. EDIT: typo

------
kbwt
So you need to dedicate two cores, making them unavailable for general-purpose
tasks in order to reduce the syscall overhead for a single application?

~~~
rwmj
It's not at all unusual for our HFT and other realtime customers to dedicate
cores to particular tasks. They often boot with isolcpus to reduce the number
of CPUs visible to the scheduler, then dedicate those CPUs to servicing
interrupts or running critical programs. They also turn off all power
management so the CPUs are burning at 100% all the time.

~~~
amluto
> They also turn off all power management so the CPUs are burning at 100% all
> the time.

That seems quite silly to me. That wastes a bunch of TDP and likely loses some
Turbo.

~~~
bdavis__
you do this to minimize latency. to ensure things happen in usecs instead of
msecs, this is what you have to do. deliberate tradeoff. it is better to waste
electricity, and to ensure your deadlines are met than for some to be late and
some to be early.

changing the speed of a cpu is very expensive in time, and can take a couple
of ms to complete (of course, most of this is caused by how it is mechanized
in the OS).

~~~
amluto
Transitions from C0 to C1 and back are very fast [0], and basically any
application has at least a core and probably several doing kernel things like
servicing network IRQs.

I agree that HFT applications should not be using deep C states.

[0] Unless you have a misguided feature like C1E auto-promotion on.
Fortunately, recent Linux kernels will overriding firmware and turn it off. On
Sandy Bridge and possibly newer CPUs with C1E auto-promotion on, resuming from
C1 can take many, many milliseconds.

------
d33
I wonder if this kind of approach could be generalized to all syscalls -
pushing them asynchronously to shared memory, then receiving output in
arbitrary order once kernel core takes care of it. Is this feasible? Does it
make sense to expand this beyond IO? My reasoning is that we already usually
have more cores than we need, why not dedicate some specifically for kernel
stuff?

~~~
blattimwind
That's almost literally IOCPs.

~~~
d33
Interesting. Where could I read up on that?

~~~
rrdharan
[https://en.wikipedia.org/wiki/Input/output_completion_port](https://en.wikipedia.org/wiki/Input/output_completion_port)

------
rorhug
Would someone care to summarise the possible benefits of being able to do
polled IO, without entering the kernel?

~~~
snaky
Performance, performance and performance.

~~~
sargun
Performance, latency variability and sanity. It’s somewhat easier to write
applications without having to make a syscall, which may have unknown latency
(even if it’s a non blocking poll).

------
lkurusa
The idea of avoiding context switches by using memory regions to communicate
with the kernel reminds me of FlexSC[0], which was a way to have "async"
system calls and "flexible scheduling" of them.

[0]:
[https://www.usenix.org/legacy/events/osdi10/tech/full_papers...](https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Soares.pdf)

~~~
geofft
I've never totally followed why AIO is its own set of interfaces instead of a
generic async system call mechanism based on something like this. Write a
bunch of requests to a page in the form of the registers that would make up
the syscal, then wait for one to be completed using a futex or something
(where futex(2) remains an actual system call).

~~~
naasking
You just reinvented async-await/continuations. It's a great idea.

------
jbverschoor
What ever happened to singularity, where everything ran in the same space?

~~~
geofft
Morphed into Midori, which never made it past being a research project.
[http://joeduffyblog.com/2015/11/03/blogging-about-
midori/](http://joeduffyblog.com/2015/11/03/blogging-about-midori/) has some
posts about other things they explored in Midori.

~~~
gpderetta
Does spectre completely kill the hope of single address space OSes ever seeing
the light of day?

------
gbrown_
The following may also be of interest to readers.

[https://lore.kernel.org/linux-
block/20181117235317.7366-1-ax...](https://lore.kernel.org/linux-
block/20181117235317.7366-1-axboe@kernel.dk/)

------
franblas
Is it dangerous to perform some IO bypassing the kernel ? I suppose not
regarding this commit, but the kernel isn't there to check calls of the
application ?

~~~
int0x80
The kernel handles the IO. You just don't need to enter the kernel (syscall)
to issue it, it is done via a kernel<->user memory mapped ring buffer.
Roughly, I don't really know the details well. See
[https://lwn.net/Articles/743714/](https://lwn.net/Articles/743714/)

~~~
k__
So you don't "ask" the kernel, but pump your data inside a special memory
space where the kernel will perform IO with it?

~~~
tinus_hn
I think the kernel provides results in a special memory space where you can
read it without using system calls. Performing operations in the kernel on a
memory area writeable by user space sounds really dangerous.

~~~
int0x80
It is possible. As you say, not completely problem free and indeed has raised
some discussions regarding security implications. AFAICT documentation is
scarce (non-existent). Here is a previous patch prepping for this one, with
sample code to both submit/reap events:
[https://patchwork.kernel.org/patch/10712819/](https://patchwork.kernel.org/patch/10712819/)

------
en4bz
So this is basically packet MMAP but for disks?

------
ksec
Is there something similar available on FreeBSD?

------
glenrivard
I believe this is similar to what we will see with Zircon as a default.

Why I believe multi core performance with Zircon will exceed the same with
Linux. The big question is single core?

You do not have the benefit. But I do think Google will do their own CPU
optimized for Zircon. Do hope they use RISC-V for the ISA. They did with the
PVC.

~~~
spatulon
BTW, you appear to be shadowbanned. I vouched for your post to un-kill it.

~~~
glenrivard
Thanks! Have no idea why banned and was never given any explanation why. But
it is what it is. Guess I could just make a new account.

~~~
nkurz
I don't know if the explanation is true, but from the outside it looks like
you were given a very clear explanation:

    
    
      sctb 67 days ago [-]
      We've banned this latest shill account. Please do not create   
      any more to abuse Hacker News with a Google agenda.
    

[https://news.ycombinator.com/item?id=18179838](https://news.ycombinator.com/item?id=18179838).

