Hacker News new | comments | ask | show | jobs | submit login
Linux: IO without entering the kernel (kernel.dk)
210 points by pplonski86 35 days ago | hide | past | web | favorite | 75 comments

This is promising, but for dedicated high-performance I/O I am much more interested in how the SPDK is progressing. Avoiding syscalls is great, but without any benchmarks I wonder about the benefits over mmap and vectorizing. You still have to contend with locking, interrupts, copying, and ring 0 abstraction impedence.

[1] https://spdk.io

SPDK follows the same pattern as DPDK (for networking) but it has much less of a practical use case in my experience. You can easily outperform mmap() etc using io_submit() interface that already exists, saturating the capabilities of fast storage hardware, and the overhead of io_submit() is nominal if you are using it well. SPDK won't meaningfully increase your I/O throughput and the API is significantly more difficult to use. I've played around with SPDK in database storage engines and I find it to be inferior for the purpose compared to io_submit() in practical implementations. The number of IOPS required to make SPDK worth considering would be indicative of a more fundamental design flaw in your storage architecture, at least for the foreseeable future.

DPDK, by contrast, can dramatically improve networking throughput and performance. While also not the most friendly API to use, it undeniably improves CPU efficiency for packet processing and without an obviously better alternative.

Do you have a high performing io_submit() example?

A high-performance io_submit()/O_DIRECT implementation looks similar to a more explicit reimplementation of mmap() and friends, so the performance does not come from the API per se but what it enables.

The primary difference is that manipulating/measuring the fine-grained runtime internal state of mmap() is difficult/expensive, whereas with io_submit() you have almost perfect visibility into and control of the entire internal state of your ersatz mmap(). With io_submit() you never block on page faults or write backs, since your scheduler explicitly controls when they happen and knows when they complete. Admittedly there is a large implementation gap between the io_submit() APIs and a high-performing mmap() replacement.

Throughput in database-y software is largely driven by the effectiveness of the scheduling -- doing exactly the right thing at exactly the right time. The fine-grained awareness and control of your instantaneous disk I/O and cache state makes it possible to build efficient schedules with io_submit().

With mmap() you are never entirely sure what is going under the hood and Linux will often decide to do suboptimal things at suboptimal times, or choose to ignore you -- most control of mmap() behavior is advisory only.

A little bit of a misnomer. You have to mention that IO and driver interaction is still being done in kernel. Just intermediary buffers are remaining MMAPed in both kernel and userspace.

That's a step forward from syscalls and excessively frequent context switching, but it is not a real case of userspace driving hardware directly. The later still holds a lot of promise, but it will require hardware to be more smart on its own, and be able to support access concurrency.

Something like that is already being done by hardware supporting virtualisable I/O

This is what DSSD did since ~ 2014; drive NVMe directly to the D5 from user app, never incurring a context switch (in fact the D5 was mapping into memory on all the PCIe connected clients). Latency was astonishingly low, dominated by PCIe switches for writes and NAND reads for, eh, reads. EDIT: typo

So you need to dedicate two cores, making them unavailable for general-purpose tasks in order to reduce the syscall overhead for a single application?

If you are coding your application against libaio chances are the entire point of the system is to run that application.

High performance software generally does divide the system up so that certain cores serve certain roles. i.e IRQ handling etc. This is often done to exploit NUMA architecture, ensuring that memory accesses only happen against the local memory controllers and locally attached PCIE where possible.

If you are going to this point you are rarely if ever concerned with sharing the system with anyone else.

It's not at all unusual for our HFT and other realtime customers to dedicate cores to particular tasks. They often boot with isolcpus to reduce the number of CPUs visible to the scheduler, then dedicate those CPUs to servicing interrupts or running critical programs. They also turn off all power management so the CPUs are burning at 100% all the time.

Standardizing the kernel-bypass of all the different network cards for absolute minimal latency would make life much easier in HFT.

> They also turn off all power management so the CPUs are burning at 100% all the time.

That seems quite silly to me. That wastes a bunch of TDP and likely loses some Turbo.

you do this to minimize latency. to ensure things happen in usecs instead of msecs, this is what you have to do. deliberate tradeoff. it is better to waste electricity, and to ensure your deadlines are met than for some to be late and some to be early.

changing the speed of a cpu is very expensive in time, and can take a couple of ms to complete (of course, most of this is caused by how it is mechanized in the OS).

Transitions from C0 to C1 and back are very fast [0], and basically any application has at least a core and probably several doing kernel things like servicing network IRQs.

I agree that HFT applications should not be using deep C states.

[0] Unless you have a misguided feature like C1E auto-promotion on. Fortunately, recent Linux kernels will overriding firmware and turn it off. On Sandy Bridge and possibly newer CPUs with C1E auto-promotion on, resuming from C1 can take many, many milliseconds.

True. Also overclocking and watercooling are common.

Dedicating cores to specific server tasks is normal in modern high-performance software architectures. It can very substantially improve throughput of the system. For server code generally, and data intensive software specifically, there is a reasonable and valuable presumption that your hardware resources are dedicated to your task. This assumption greatly simplifies the design of high-performance software and tends to match the deployment environment of performance-sensitive software in any case.

Many high-performance software applications are constrained by network/disk bandwidth, so you would have cores doing no productive work regardless. This allows you to re-deploy those unused cores in other creative and useful ways. Specialization of cores has the added benefit of minimizing lock contention and making it simpler to reason about complex interactions between threads.

I don't know if it is the case, but, I would happily trade off two cores to get better IO in most cases in our nodes servicing IO bound jobs. Otherwise, those CPU cores are just idling anyway because of IO..

With the current growth of CPU leaning towards more and more cores with ever smaller improvements in single core performance, dedicating spare cores for such perm improvements is not at all a bad idea.

The only thing this really improves is latency, because you can always batch more I/O into a single syscall.

You can batch 1000 writes to the same socket into a single syscall, but you can’t batch writing to 1000 different sockets into one syscall. (And it is actually a common occurrence to write the same message to 100 sockets at the same time - e.g. in an IRC server with 100 people in A room)

There's io_submit which lets you batch IOs on any number of fds. It appears to lack socket support, but I don't see any reason why that couldn't be added.

There is a fundamental difference between disk I/O and network I/O that cause them to be treated differently.

For disk I/O, you have absolute control over the number and type of events that may be pending, allowing you to strictly bound resource consumption and defer handling of those events indefinitely with no consequences. Similarly, you can opportunistically use unused disk I/O capacity to bring future I/O events forward with little cost e.g. pre-fetching. The worst case scenario you have to deal with in terms of disk I/O is self-inflicted, which makes for relatively simple engineering.

Network I/O events, by contrast, are exogenous and unpredictable. You often have little control over the timing or the quantity of these events, and the only limit on potential resource consumption is the bandwidth of your network hardware. Not only do you have to handle the worst case scenario in these cases, you also have little control over what a worst case scenario can look like. This leads to very different design decisions and interactions with the I/O subsystem versus disk.

Io_submit is part of the Linux aio subsystem. Thus patch is literally a extension to that subsystem to avoid doing a kernel call to submit aio requests and collect results.

Sure, but this patch is only useful for single-purpose, HFT-like workloads. Batched io_submit on sockets would be useful for any application sending many small packets to lots of clients, such as game servers.

For network io, an userspace accelerated network stack is probably better though.

This is basically how the entirety of DPDK using a poll-mode driver works. Each worker thread is doing busy polling on a shared memory segment, ensuring minimal packet processing latency.

This is a bit different though. DPDK completely bypasses the kernel and userspace directly talks with the network adapter.

I think this is mostly for disk I/O and the userspace component still communicates with the kernel that controls the hardware and performs permission checks.

So it’s like netmap for disk.

Yes. But actually you don’t only donate the cores to the application but also the IO device which is acted upon (E.g. the drive or network adapter). This also makeshift sense for some very specific high performance tasks, but not for a general purpose system.

If the AMD leaks are to be believed, we will have cheap (less than about $400) 16 and 12 core consumer processors next year. I’d happily trade two of those cores off against syscall overhead.

I wonder if this kind of approach could be generalized to all syscalls - pushing them asynchronously to shared memory, then receiving output in arbitrary order once kernel core takes care of it. Is this feasible? Does it make sense to expand this beyond IO? My reasoning is that we already usually have more cores than we need, why not dedicate some specifically for kernel stuff?

I have seen this technique being used to implement syscalls for processes running on intel SGX [0]. In that cases, it makes even more sense because exiting SGX, making a syscall and then reentering SGX is a lot of overhead for a single syscall, while writing to shared memory is doable without exiting the SGX environment.

[0]: https://www.usenix.org/system/files/conference/osdi16/osdi16...

Talking with GPUs works a lot like this using modern APIs. You fill a command buffer with stuff and then asynchronously get results back "later", using synchronization primitives and memory mapping. There are syscalls to kick off the command buffers but those can (and often are) be implemented in userspace by the driver.

I believe that describes the L4 messages. So, yes, it can be generalized, and there is a widely used kernel family out there that only uses it.

But you will lose some performance on some of the syscalls. While an interruption is expensive, memory access is expensive too. When you have only one of the options, you will have stuff you can't optimize. The thing with IO is that the slower part of the memory access can be done by a co-processor, so you get the entire CPU available for more important work.

Modern L4 use synchronous messaging because asynchronous messages leave you vulnerable to DoS.

I had this exact idea long ago, the syscall and interacting language would have to be different due to a mismatch, a graph language of some sort to describe dependent calls and batch and a language that makes async less painful. The non obvious benefit is a compiler could optimize across protection domains.

do you think that model can usefully be extended to distributed services to reduce round trips?

There is mmap for read/write (some of the most used syscalls).

There is also the vdso page [1] for things like gettimeofday etc.

[1] http://man7.org/linux/man-pages/man7/vdso.7.html

There's a concept called exception-less syscalls, which might be similar to what your thinking.

My idea is to avoid context switching like in VDSO, but with a generalized solution that would allow this sort of behavior for all system calls. Could you point me to any write-ups about exception-less syscalls?

See the FlexSC paper at the bottom here: https://www.usenix.org/conference/osdi10/flexsc-flexible-sys... or more recently SCONE in the context of Intel SGX: https://www.usenix.org/conference/osdi16/technical-sessions/...

That's almost literally IOCPs.

Interesting. Where could I read up on that?

I'm relatively sure ICOP still requires two syscalls per IO.

Yes, that's why I wrote almost. IOCPs do the other half (receiving results in arbitrary order in a thread pool managed by the kernel). I don't see any particular reason why a batched syscall wouldn't work for it. (Note that Windows has Read/WriteFileScatter/Gather, which is multiple reads from the same handle [as opposed to multiple reads from different handles]).

ICOP is the opposite of polling. I really don't see the resemblance here.

Would someone care to summarise the possible benefits of being able to do polled IO, without entering the kernel?

In traditional I/O, a hardware interrupt is triggered whenever data arrives at hardware boundary and the interrupt can get serviced by any core that is available to the scheduler. One can imagine how much overhead is involved in context switching whatever that core was doing before, setting up the registers, moving data and then relinquishing the core back to OS - in this model, dedicated cores serve I/O in a memory mapped ring buffer like data structure sized to your application needs. There is no allocation/deallocation overhead, no management beyond moving a pointer and no context switching. If you can spare the cores, this can significantly improve performance.

In one use-case, I was able to quadruple the performance on a 32 core xeon by installing 4 10gbps ethernet cards and dedicating the first eight cores to I/O (2 per interface). This is all about latency but with proper care, it also improves throughput.

Do you have to write your own software to do this, or can it be accomplished through OS configuration?

The motivating benefit is performance, but a side one the author mentioned on Twitter https://twitter.com/axboe/status/1073320502532263936 is sidestepping Meltdown and similar vulnerabilities from having the kernel and the OS in the same address space (even though they're separated by a privilege boundary). In a scheme like this, you can theoretically dedicate one core to the application and a separate one to the kernel, and minimize speculation, cache sharing, etc. between the two. The application and the kernel share a portion of memory, so the kernel doesn't ever run on the application's CPU.

This is questionably practical for a general-purpose machine, but for a server system used entirely as a hypervisor, or web server, or file aerver, or something, it might fit really well.

But wouldn't having the kernel pinned to a different core hurt performance due to NUMA, or through having to do lots of cross-calls?

Depends on the use case; keep in mind that syscalls are slow, too. If you have an application that does significant computation on lots of data (think a scientific calculation/simulation), having another core on the same socket read ahead from disk to RAM might be much more efficient than pausing computation to read synchronously. Or if you're a file server that is just passing things back to the kernel's network layer, you might not even need to see the contents of RAM yourself.

Someone more familiar with kernel workings than me should clarify, but my understanding is that IO generally happens via a syscall which requires the thread/process in question to context switch between userspace and kernel space, which can be very expensive. By enabling IO polling in userspace, you get to avoid that context switching.

Yes. Instead of using a syscall to get/issue events you use a mapped ring buffer. See https://lwn.net/Articles/743714/

Performance, performance and performance.

Performance, latency variability and sanity. It’s somewhat easier to write applications without having to make a syscall, which may have unknown latency (even if it’s a non blocking poll).

Should be about 50x faster, based on my rough estimates

The idea of avoiding context switches by using memory regions to communicate with the kernel reminds me of FlexSC[0], which was a way to have "async" system calls and "flexible scheduling" of them.

[0]: https://www.usenix.org/legacy/events/osdi10/tech/full_papers...

I've never totally followed why AIO is its own set of interfaces instead of a generic async system call mechanism based on something like this. Write a bunch of requests to a page in the form of the registers that would make up the syscal, then wait for one to be completed using a futex or something (where futex(2) remains an actual system call).

You just reinvented async-await/continuations. It's a great idea.

What ever happened to singularity, where everything ran in the same space?

Morphed into Midori, which never made it past being a research project. http://joeduffyblog.com/2015/11/03/blogging-about-midori/ has some posts about other things they explored in Midori.

Does spectre completely kill the hope of single address space OSes ever seeing the light of day?

JX OS took that approach for most of its protection with the code being open source:


The following may also be of interest to readers.


Is it dangerous to perform some IO bypassing the kernel ? I suppose not regarding this commit, but the kernel isn't there to check calls of the application ?

The kernel handles the IO. You just don't need to enter the kernel (syscall) to issue it, it is done via a kernel<->user memory mapped ring buffer. Roughly, I don't really know the details well. See https://lwn.net/Articles/743714/

So you don't "ask" the kernel, but pump your data inside a special memory space where the kernel will perform IO with it?

I think the kernel provides results in a special memory space where you can read it without using system calls. Performing operations in the kernel on a memory area writeable by user space sounds really dangerous.

It is possible. As you say, not completely problem free and indeed has raised some discussions regarding security implications. AFAICT documentation is scarce (non-existent). Here is a previous patch prepping for this one, with sample code to both submit/reap events: https://patchwork.kernel.org/patch/10712819/

So this is basically packet MMAP but for disks?

Is there something similar available on FreeBSD?

I believe this is similar to what we will see with Zircon as a default.

Why I believe multi core performance with Zircon will exceed the same with Linux. The big question is single core?

You do not have the benefit. But I do think Google will do their own CPU optimized for Zircon. Do hope they use RISC-V for the ISA. They did with the PVC.

BTW, you appear to be shadowbanned. I vouched for your post to un-kill it.

Thanks! Have no idea why banned and was never given any explanation why. But it is what it is. Guess I could just make a new account.

I don't know if the explanation is true, but from the outside it looks like you were given a very clear explanation:

  sctb 67 days ago [-]
  We've banned this latest shill account. Please do not create   
  any more to abuse Hacker News with a Google agenda.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact