
I’m Not Dead yet; The Role of the Operating System in a Kernel-Bypass Era [pdf] - jsnell
http://irenezhang.net//papers/demikernel-hotos19.pdf
======
mbjorling
It is worth mentioning that the Linux kernel has a new kernel API (io_uring)
that changes the whole argument around using libos designs. With the new
io_uring library (available with Linux kernel 5.1), peak IOPS per core is 1.7M
IOPS... Which beats or is close to SPDK performance[0]. Later updates to the
patches improves the throughput even more.

Jens (the author) has done a great writeup [1]

[0] [https://lore.kernel.org/linux-
block/20190116175003.17880-1-a...](https://lore.kernel.org/linux-
block/20190116175003.17880-1-axboe@kernel.dk/) [1]
[http://kernel.dk/io_uring.pdf](http://kernel.dk/io_uring.pdf)

~~~
bryanwb
can you elaborate on how io_uring bypasses the kernel?

~~~
ncmncm
io_uring dumps data directly into a ring buffer mapped into the user-level
address space. User code is notified by (at least) an updated atomic counter.
The user process must be finished with the data before the kernel comes around
again to overwrite it. Often that demands the user process or thread is bound
to a core which the OS has been forbidden to run anything else on, and the
thread does a carefully circumscribed amount of work, rarely including memory
allocation, i/o, or even system calls, that may cause it to be "lapped" by
subsequent writes.

The idea is that the average time to process a packet absolutely must not
exceed the average arrival rate, and the sum of spikes in arrival rate must
average out over the size of the ring buffer to less than the process rate.

The hamster process pulling from the ring may just be load balancing to a herd
of other threads operating under less stringent conditions, so they might be
permitted i/o.

~~~
bryanwb
tks for the great explanation!

------
ncmncm
Every time somebody comes in with another abstraction scheme, my first thought
is, "great, how do I bypass it?". Give me onload, I use ef_vi. Give me
exasock, I use exanic.

The amount of code to operate at the lower level turns out to never be more
than a hundred or two lines of code (much less for exanic), but always
eliminates latency that comes from doing crap I will just need to undo, or
redo differently.

AF_XDP and eBPF suggest the promise of making that code portable, yet running
it in the NIC itself, possibly even eliminating a polling thread that uses up
a whole core on the host, but it seems to need more support in the library
available to the eBPF code. Specifically, it needs better access to (pre-
permission-checked and mapped) DMA to host RAM, and precise, accurate
timestamps. They don't need to be pretty, but they need to exist.

~~~
lukego
Amen.

------
bryanwb
interesting paper but where is the code? I am not familiar w/ DPDK but how
would the control path of demikernel configure a dpdk device so that it can
service multiple applications. perhaps this isn't necessary for DPDK?

It is hard to take this paper seriously if it is just a thought experiment and
they haven't actually implemented Demikernel. In figure 3 there is a list of
the syscalls but that isn't enough to convince me that they have actually
implemented Demikernel.

