Remember those huge page benchmarks? All that perf gain was due to reduced TLB hits. Something you don't have to worry about if your turning paging off.
So, IMHO its not just the couple percent from TLB misses that slow things down, its the combination of a couple percent from TLB misses, a couple percent from the IOMMU, a couple percent from cgroups (yah i've seen double digit loss in some benchmarks from cgroups), or other random kernel features. All combining into significant loss.
IMHO, a large reason that things like ztpf/zseries mainframes still exist is because the code written ~50 years ago, is effectively bare metal transaction processing running on what works out to be basically a modern midrange server's L1 cache. When you add all the overhead of rewritting it "properly" in java with a framework, on a heavy weight virtualized OS it adds a couple orders of magnitude perf loss. So instead of buying a couple million $ machine from IBM your buying a few million in cloud services. Its no wonder IBM's remaining customers have such a hard time getting off the platform.
For a more in touch example, boot windows 3.1 in a VM on your laptop. Install a few apps, and give it a spin. The entire OS probably fits in the L3, and that is reflected by everything being basically instant.
And you can still SSH into your machine, run top, gdb, etc. :-)
You get all the benefits of a Linux server as far as orchestration and your control layer are concerned but also most of the benefits of running it bare metal like a unikernel.
That's just not true. That's why for example, hard real time work using linux universally runs the real time code in a higher context than the kernel, not a lower one. Going back to the original topic, TPF like unikernels have tighter real time constraints than can be satisfied by Linux, even with all the preemption games you can play. Just like running standard real time code that's, say, managing the speed of turbines in a jet engine, can't be done in a regular linux process. If you really want Linux in either of these envs, you run something like l4linux or rtlinux which creates a higher privilege VMM like component that can give you real time guarantees.
That's because even when you're not using the kernel for data plane ops for syscalls, that core is still subject to kernel book keeping preemption events like TLB shootdown IPIs, and pretty much anything in a for_each_cpu block in the kernel.
But as for kernel bookkeeping tasks still causing preemptions, I'm talking about using isolcpus and nohz_full. Upon digging further there's the proposed task isolation mode patch to eliminate literally all possible preemptions but even without that in practice it seems like the common wisdom of using isolcpus and nohz_full is more or less sufficient. Here's a blog entry from May that tests this exact thing on a modern kernel without using the proposed patch.
Key point: "I’ve successfully used this technique to verify that I can run a bash busy-loop for an hour without entering the kernel."
There's certainly things you could do that would cause a preemption but to use your example, you're not going to get an IPI to flush some entry in the TLB if your program doesn't have some shared pages mapped in where the mapping changes. Writing to an mmapped file is going to cause the TLB to be modified once those dirty pages are flushed to disk and if you're only allowing that to happen by kthreads on a different processor then yeah, obviously it needs an IPI to do that. But you have the option of being as restrictive as you need to be for your particular use case. If you don't care about the minor latency increase of an occasional page fault you don't have to go all in on a full RTOS.
The paper largely:
1. Lay the reasoning of direct physical memory programs. The primary argument is that the cumulated complexity in chip and OS, for virtual memory is too large; and that limits the design scope and retards advancement (huge page is used as an example). Meanwhile, 5 major changes needed in OS and application programs to use physical memory directly were listed, which appears fairly straightforward (no doubt more complexity is hidden).
2. The paper then explored the performance penalty of supporting large continuous memory abstraction in such a new programming paradigm. They proposed 2 cases needed to support the abstraction: dynamic allocated stack, and tree-based array. Experiments are carried out to run modified program that addresses these 2 cases.
One thing I could not easily see is how their experiment rules out the existing virtual memory's impact to their modified programs, which are supposed to not need to use virtual memory.
In this paper, we envision a _physically addressed_software architecture, without address translation, in which the OS hands out fixed-size blocks of memory, and applications allocate data structures across those blocks.
But IMO maybe paying that 10% is worth it in a lot of cases for the savings in design/development cost.
(arxiv.org also blocks any request without a user-agent)
Never had a chance to do a Unix port to the ST, but it would have been fun to use that hardware.
In general you could force the usage of any memory safe language.
Here, In this paper they used/reference something call CARAT, Compiler and Runtime address translation, which works at the level of LLVM IR, it seems interesting but I haven't read through the CARAT papers yet.