Hacker News new | past | comments | ask | show | jobs | submit login
The Cost of Software-Based Memory Management Without Virtual Memory (arxiv.org)
72 points by ingve 38 days ago | hide | past | favorite | 21 comments



So much of this is application/usage pattern dependent.

Remember those huge page benchmarks? All that perf gain was due to reduced TLB hits. Something you don't have to worry about if your turning paging off.

So, IMHO its not just the couple percent from TLB misses that slow things down, its the combination of a couple percent from TLB misses, a couple percent from the IOMMU, a couple percent from cgroups (yah i've seen double digit loss in some benchmarks from cgroups), or other random kernel features. All combining into significant loss.

IMHO, a large reason that things like ztpf/zseries mainframes still exist is because the code written ~50 years ago, is effectively bare metal transaction processing running on what works out to be basically a modern midrange server's L1 cache. When you add all the overhead of rewritting it "properly" in java with a framework, on a heavy weight virtualized OS it adds a couple orders of magnitude perf loss. So instead of buying a couple million $ machine from IBM your buying a few million in cloud services. Its no wonder IBM's remaining customers have such a hard time getting off the platform.

For a more in touch example, boot windows 3.1 in a VM on your laptop. Install a few apps, and give it a spin. The entire OS probably fits in the L3, and that is reflected by everything being basically instant.


I didn't think you could even run code not in an LPAR on a zSeries machine anymore. At which point you still have all the MMUs (and their associated TLBs) turned on. It's just like you're running a unikernel rather than something like Linux, so you should be able to replicate that on commodity hardware for far cheaper.


You can run Linux like a unikernel by limiting it to just one core at launch, and giving the other cores 100% to your app. No TLB thrashing, context switches, nothing. Just your code running without any interference whatsoever by the kernel or other processes...

And you can still SSH into your machine, run top, gdb, etc. :-)


That's not enough last time I checked. Hence kernels like Kitten that are Linux syscall compatible, but are designed for far less jitter.

https://github.com/HobbesOSR/kitten


The idea isn't to use the Linux kernel at all for the applications data path. You can have a Linux distro start your application and then once it's running you don't necessarily need to make any syscalls at all. You can pass through devices to userspace directly using tools like DPDK, SPDK, or NVMeDirect so that you could e.g. let Linux handle the gigabit ethernet management port and let DPDK control the 100G Mellanox card. You could boot the server off of some SATA SSDs and use SPDK to give your application direct access to the fancy optane 905P. Rather than using OS threads and have the kernel handle scheduling you can use green threads and still get concurrency even on a single core and still be able to write your code such that you can still meet any latency or throughput requirements you have.

You get all the benefits of a Linux server as far as orchestration and your control layer are concerned but also most of the benefits of running it bare metal like a unikernel.


> and still be able to write your code such that you can still meet any latency or throughput requirements you have.

That's just not true. That's why for example, hard real time work using linux universally runs the real time code in a higher context than the kernel, not a lower one. Going back to the original topic, TPF like unikernels have tighter real time constraints than can be satisfied by Linux, even with all the preemption games you can play. Just like running standard real time code that's, say, managing the speed of turbines in a jet engine, can't be done in a regular linux process. If you really want Linux in either of these envs, you run something like l4linux or rtlinux which creates a higher privilege VMM like component that can give you real time guarantees.

That's because even when you're not using the kernel for data plane ops for syscalls, that core is still subject to kernel book keeping preemption events like TLB shootdown IPIs, and pretty much anything in a for_each_cpu block in the kernel.


I'll admit I shouldn't have said "any" latency requirements but I was speaking specifically about optional cooperative concurrency within an OS thread. I realize that this doesn't magically give you a platform on par with the kind of hard real time deterministic performance you get out of e.g. the new Cortex-R82.

But as for kernel bookkeeping tasks still causing preemptions, I'm talking about using isolcpus and nohz_full. Upon digging further there's the proposed task isolation mode patch to eliminate literally all possible preemptions but even without that in practice it seems like the common wisdom of using isolcpus and nohz_full is more or less sufficient. Here's a blog entry from May that tests this exact thing on a modern kernel without using the proposed patch.

https://www.codeblueprint.co.uk/2020/05/03/reducing-jitter-o...

Key point: "I’ve successfully used this technique to verify that I can run a bash busy-loop for an hour without entering the kernel."

There's certainly things you could do that would cause a preemption but to use your example, you're not going to get an IPI to flush some entry in the TLB if your program doesn't have some shared pages mapped in where the mapping changes. Writing to an mmapped file is going to cause the TLB to be modified once those dirty pages are flushed to disk and if you're only allowing that to happen by kthreads on a different processor then yeah, obviously it needs an IPI to do that. But you have the option of being as restrictive as you need to be for your particular use case. If you don't care about the minor latency increase of an occasional page fault you don't have to go all in on a full RTOS.


A well motivated exploration.

The paper largely:

1. Lay the reasoning of direct physical memory programs. The primary argument is that the cumulated complexity in chip and OS, for virtual memory is too large; and that limits the design scope and retards advancement (huge page is used as an example). Meanwhile, 5 major changes needed in OS and application programs to use physical memory directly were listed, which appears fairly straightforward (no doubt more complexity is hidden).

2. The paper then explored the performance penalty of supporting large continuous memory abstraction in such a new programming paradigm. They proposed 2 cases needed to support the abstraction: dynamic allocated stack, and tree-based array. Experiments are carried out to run modified program that addresses these 2 cases.

One thing I could not easily see is how their experiment rules out the existing virtual memory's impact to their modified programs, which are supposed to not need to use virtual memory.


Problems line up on memory fragmentation and dynamic allocation. Systems like Macintosh (before OS X) and Windows 1-3 handled this with an additional level of indirection (handles) and relocation/compaction. You could also do it with smarter code that explicitly tagged pointers so they could be updated when the data moved.


What an odd coincidence. Just yesterday, I was benchmarking the overhead of turning on the MMU on an embedded system. It was about a 10% hit in performance, and this is without paging or even changing the mapping at all.

    In this paper, we envision a _physically addressed_software architecture, without address translation, in which the OS hands out fixed-size blocks of memory, and applications allocate data structures across those blocks.
This is essentially nearly every embedded system I've ever worked on. The dumbest possible memory allocator starts at a base and increments, never freeing memory except possibly on a major frame.


Doug Lea's malloc is not too difficult to port to embedded platforms and will probably perform better than more naive implementations. Probably need an sbrk or similar, not sure what else.

But IMO maybe paying that 10% is worth it in a lot of cases for the savings in design/development cost.


Text-only, non-PDF:

https://www.arxiv-vanity.com/abs/2009.06789

(arxiv.org also blocks any request without a user-agent)


It's actually pretty cool of arxiv to provide a gzip of the source files for the pdf, here TeX:

https://arxiv.org/format/2009.06789


How would you do things like ASLR and prevent people from guessing other software's memory allocation? Resetting a processes base pointer to 0 hides the information about other processes in the system that in a multi-tenant system you probably don't want leaking.


It’s still possible to have memory protection without having to do address translation. In a more extreme example, imagine every program on a system sharing the same 128-bit address space. There’s plenty of room for everyone, TEXT, stacks, heaps, etc, and if you try to step on another program’s memory you can still get a segfault. The Mill architecture sort of works this way, except it goes even further by leveraging this in a clever way to perform fast IPC.

https://m.youtube.com/watch?v=7KQnrOEoWEY


In the 1980s I helped design a "one gate delay MMU" for the Atari ST. It gave you relocation and bounds checking in two directions ("text" going up from zero, stack going down from infinity) in power-of-two-size increments, and we fit it into the existing DRAM controller path and timing.

Never had a chance to do a Unix port to the ST, but it would have been fun to use that hardware.


Right, I can imagine if you wrap memory allocation correctly you can do that. But CPUs have finite memory, so a program can guess where it sits in the real physical memory, and from that derive a potential list of programs that already exist. You can probably even run profiling and figure out _what_ other software is running, by their allocation totals.


I'd be interested in how feasible this would be. Even tiny address spaces like 256 bytes have tons of potential ways to fill it (factors): a process could have : 1, 2, 4, 8, 16, 32, 64, 128, or even 256 bytes. How do you identify which are running?


The security talk is probably a better example for your point.

https://www.youtube.com/watch?v=5osiYZV8n3U


ASLR doesn't prevent you from guessing, it merely makes it more difficult to guess.

In general you could force the usage of any memory safe language. Here, In this paper they used/reference something call CARAT, Compiler and Runtime address translation, which works at the level of LLVM IR, it seems interesting but I haven't read through the CARAT papers yet.


Anybody remember Mac OS handles? One solution to lacking VM was to always use pointers to pointers. That way the underlying allocations could be moved around by the OS.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: