DPDK, by contrast, can dramatically improve networking throughput and performance. While also not the most friendly API to use, it undeniably improves CPU efficiency for packet processing and without an obviously better alternative.
The primary difference is that manipulating/measuring the fine-grained runtime internal state of mmap() is difficult/expensive, whereas with io_submit() you have almost perfect visibility into and control of the entire internal state of your ersatz mmap(). With io_submit() you never block on page faults or write backs, since your scheduler explicitly controls when they happen and knows when they complete. Admittedly there is a large implementation gap between the io_submit() APIs and a high-performing mmap() replacement.
Throughput in database-y software is largely driven by the effectiveness of the scheduling -- doing exactly the right thing at exactly the right time. The fine-grained awareness and control of your instantaneous disk I/O and cache state makes it possible to build efficient schedules with io_submit().
With mmap() you are never entirely sure what is going under the hood and Linux will often decide to do suboptimal things at suboptimal times, or choose to ignore you -- most control of mmap() behavior is advisory only.
That's a step forward from syscalls and excessively frequent context switching, but it is not a real case of userspace driving hardware directly. The later still holds a lot of promise, but it will require hardware to be more smart on its own, and be able to support access concurrency.
Something like that is already being done by hardware supporting virtualisable I/O
High performance software generally does divide the system up so that certain cores serve certain roles. i.e IRQ handling etc. This is often done to exploit NUMA architecture, ensuring that memory accesses only happen against the local memory controllers and locally attached PCIE where possible.
If you are going to this point you are rarely if ever concerned with sharing the system with anyone else.
That seems quite silly to me. That wastes a bunch of TDP and likely loses some Turbo.
changing the speed of a cpu is very expensive in time, and can take a couple of ms to complete (of course, most of this is caused by how it is mechanized in the OS).
I agree that HFT applications should not be using deep C states.
 Unless you have a misguided feature like C1E auto-promotion on. Fortunately, recent Linux kernels will overriding firmware and turn it off. On Sandy Bridge and possibly newer CPUs with C1E auto-promotion on, resuming from C1 can take many, many milliseconds.
Many high-performance software applications are constrained by network/disk bandwidth, so you would have cores doing no productive work regardless. This allows you to re-deploy those unused cores in other creative and useful ways. Specialization of cores has the added benefit of minimizing lock contention and making it simpler to reason about complex interactions between threads.
With the current growth of CPU leaning towards more and more cores with ever smaller improvements in single core performance, dedicating spare cores for such perm improvements is not at all a bad idea.
For disk I/O, you have absolute control over the number and type of events that may be pending, allowing you to strictly bound resource consumption and defer handling of those events indefinitely with no consequences. Similarly, you can opportunistically use unused disk I/O capacity to bring future I/O events forward with little cost e.g. pre-fetching. The worst case scenario you have to deal with in terms of disk I/O is self-inflicted, which makes for relatively simple engineering.
Network I/O events, by contrast, are exogenous and unpredictable. You often have little control over the timing or the quantity of these events, and the only limit on potential resource consumption is the bandwidth of your network hardware. Not only do you have to handle the worst case scenario in these cases, you also have little control over what a worst case scenario can look like. This leads to very different design decisions and interactions with the I/O subsystem versus disk.
I think this is mostly for disk I/O and the userspace component still communicates with the kernel that controls the hardware and performs permission checks.
But you will lose some performance on some of the syscalls. While an interruption is expensive, memory access is expensive too. When you have only one of the options, you will have stuff you can't optimize. The thing with IO is that the slower part of the memory access can be done by a co-processor, so you get the entire CPU available for more important work.
There is also the vdso page  for things like gettimeofday etc.
In one use-case, I was able to quadruple the performance on a 32 core xeon by installing 4 10gbps ethernet cards and dedicating the first eight cores to I/O (2 per interface). This is all about latency but with proper care, it also improves throughput.
This is questionably practical for a general-purpose machine, but for a server system used entirely as a hypervisor, or web server, or file aerver, or something, it might fit really well.
Why I believe multi core performance with Zircon will exceed the same with Linux. The big question is single core?
You do not have the benefit. But I do think Google will do their own CPU optimized for Zircon. Do hope they use RISC-V for the ISA. They did with the PVC.
sctb 67 days ago [-]
We've banned this latest shill account. Please do not create
any more to abuse Hacker News with a Google agenda.