Hacker News new | past | comments | ask | show | jobs | submit login
The Case for a High-Level Kernel-Bypass I/O Abstraction (irenezhang.net)
67 points by matt_d 33 days ago | hide | past | web | favorite | 26 comments

What does it even mean by high level?

Once the API is high level enough it gets unusable by major users who are high end networking, GPU and graphics libraries and low latency sound.

Nobody else truly needs to bypass the kernel. Even low latency can work with good real time task handling, making the users exactly two cases, who have special DMA handling in hardware already. If it means introducing special case kernel bypasses for high scale computing, it's already done, and the low level APIs just get wrapped.

And the Achilles's foot is security.

If it's arguing for making all hardware a kernel free fabric, it's essentially a move of everything to firmware. Worst case, we get zero memory protection and unfixable bugs.

You would likely be mistaken. Real time kernels (like Redhat's MRG product, or the -rt patchset from Thomas Gleixner) have less jitter aka more predictable timing. However, in virtually every average case, they have slightly higher latency than the stock linux kernel.

The reality of 100G+ networking or low latency networking is the kernel can't keep up with the interrupts from the hardware, so you turn interrupts per packet off (adaptive coalescing / ethtool -C for ethernet), so userspace tcp/ip stacks such as Intel / Linux Foundation's DPDK[1], Solarflare's OpenOnload[2], Mellanox's VMA[3] and Chelsio's Wire Direct[4] exist to fill this need. Heck, even the BBC wrote their own kernel bypass[5] networking layer! Note that Solarflare, Mellanox, and Chelsio are all heavily used in High Performance Computing supercomputers along with finance such as electronic trading. If there was no need, there wouldn't be so many options due to the market wanting them.

[1] https://www.dpdk.org/

[2] https://www.openonload.org/

[3] http://www.mellanox.com/page/software_vma

[4] https://www.chelsio.com/nic/wire-direct/

[5] https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-...

Source: have worked in electronic trading as a Linux engineer for 11-12ish years.

Here is another good overview of this: https://blog.cloudflare.com/kernel-bypass/

High-level means not exposing hardware limitations to the application. The primary target applications are datacenter services, which spend much of their time processing network I/O. As network latencies lower to a few microseconds, datacenter applications like Redis will need kernel-bypass because the kernel will become too expensive for them. In our experiments with a 25Gb network, the Linux kernel and POSIX interface costs Redis 60% of its latency.

Network I/O is a major bottleneck for Redis but they leave a lot on the table by being single threaded. I can speak from experience because I maintain a Multithreaded Fork: https://github.com/JohnSully/KeyDB

KeyDB can easily get 2-3x the QPS with half the latency.

This is IMHO a wrong analysis. Redis can be scaled by being single threaded by running multiple processes: then if you remove the overhead of the network stack, each process can deliver more QPS, not just better latency. By using threads (which Redis now in parts also does, but and gets 2X performance by making threaded just 0.01% of the code, that is, a single function) you continue to incur in the I/O penalty, just amortized in more threads, but it continues to be a waste. Also the latency you measure as reduced with threads is an illusion: it happens only during benchmarks because the instance is saturated more when running on a single thread. If you measure single-request latencies, they are dominated by the network stack latency.

The lower latency is not an illusion, it is indeed lower latency for servers with high load. If you don't have high load then I agree the need for threads is eliminated - but people using Redis for real work have traffic where this becomes an issue. Multiple processes require clustering or sharding each with its own set of overheads (both in CPU and human terms).

You and I disagree vehemently on this (hence the fork), but I really think your optimizing for your own simplicity not that of the user's. It should be the opposite since the developer has the most insight into the software.

I don't think you understood my comment. What I mean is that regardless of what you think of Redis and threads the fact that doing IO is so wasteful and adds latency and CPU time remains and is a constant.

How do those processes communicate?

> High-level means not exposing hardware limitations to the application.

This seems counter-intuitive.

Hardware limitations mean different abstraction than OS-level APIs, as them to applications.

Even POSIX does not expose hardware limitations.

Rather, high-level in the paper is more like some suitable interface to a wide range of applications. I.e., high-level as it's targeted to be used directly by applications as a portable interface.

I consider POSIX to be high-level. The RDMA and DPDK interface are not.

So what's the API look like?

You said you don't want to make users deal with flow control and hardware details.. does that imply a userspace bypass library which does that stuff for us? Does it look posixy?

Solarflare's OpenOnload or Mellanox's VMA both show up as LD_PRELOADS that overload any traditional socket programming unless you want to code your apps to their API directly.

It looks POSIX-like but uses high-level queues and fixes some issues with epoll. The lack of an atomic data unit and the overhead of the poor epoll interface cost too much to retain for kernel-bypass. Take a look at the paper for more details.

Where’s the paper? After looking at your site, it’s not obvious to me what paper to look at.

Well, that is the whole point of unikernels.

Memory safe languages with rich runtimes, only need a mini kernel to run bare metal.

Windows has been pushing for user space drivers for a while now, including GPUs.

Android is following the same path with Project Treble, and who knows what will happen with Fuchsia.

Likewise on many high integrity OSes for embedded deployment.

Does such a runtime and ecosystem exist for go, and if not, (even if so) what other languages?

Java has baremetal deployments on the embedded space, PTC and Aicas are the two most well known ones.

.NET has netduino and eventually meadow, although it is a subset.

Erlang has GRiSP.

OCaml has MirageOS.

To come back to your question, Go has tinyGo, gVisor, emgo, Biscuit.

And you can have a look at this as well, https://nanovms.com/dev/tutorials/running-go-unikernels

And here http://unikernel.org/projects/

> we found that 30% of the cost of the Linux kernel comes from its interface. This overhead is just too much to carry around while using kernel-bypass devices.

One third of the cost is actually expensive !

Also, ScyllaDB NoSQL database(C++ clone of Cassandra) uses Seastar framework to achieve high IO throughput.


I've updated the blog post with our experimental results from the Redis benchmark. Here is a link to the graph: http://irenezhang.net/img/demikernel-redis-exp.jpg

BTW, the Demikernel will be open-sourced shortly .. as soon as I return from giving a talk at KubeCon Europe.

I am surprised and disappointed that the original paper and the blog post has zero reference to unikernel research, despite the fact that unikernel pretty much is the whole encompassing idea.

I am wondering whether or not this is a missing or a different understanding the concept.

Edit: Sorry I did not really get the difference between library OS and unikernels.

It's still a lack of reference considering their connections.

The Demikernel is not a unikernel. It is a library OS compiled as a series of shared libraries. It is not compiled together with the application and doesn’t take into account what features the application uses. It is designed to work with kernel-bypass hardware, like DPDK.

What about UIO?

It depends on the interface for the drivers to the application. However, UIO doesn't seem to support DMA, which is a non-starter.

RDMA and DPDK both use user-space drivers, which is necessary for kernel-bypass. I'm not advocating for a particular kernel-bypass solution. I'm arguing that if we use kernel-bypass for I/O, we should have a common, efficient, high-level interface for it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact