
The Cost of Software-Based Memory Management Without Virtual Memory - ingve
https://arxiv.org/abs/2009.06789
======
StillBored
So much of this is application/usage pattern dependent.

Remember those huge page benchmarks? All that perf gain was due to reduced TLB
hits. Something you don't have to worry about if your turning paging off.

So, IMHO its not just the couple percent from TLB misses that slow things
down, its the combination of a couple percent from TLB misses, a couple
percent from the IOMMU, a couple percent from cgroups (yah i've seen double
digit loss in some benchmarks from cgroups), or other random kernel features.
All combining into significant loss.

IMHO, a large reason that things like ztpf/zseries mainframes still exist is
because the code written ~50 years ago, is effectively bare metal transaction
processing running on what works out to be basically a modern midrange
server's L1 cache. When you add all the overhead of rewritting it "properly"
in java with a framework, on a heavy weight virtualized OS it adds a couple
orders of magnitude perf loss. So instead of buying a couple million $ machine
from IBM your buying a few million in cloud services. Its no wonder IBM's
remaining customers have such a hard time getting off the platform.

For a more in touch example, boot windows 3.1 in a VM on your laptop. Install
a few apps, and give it a spin. The entire OS probably fits in the L3, and
that is reflected by everything being basically instant.

~~~
monocasa
I didn't think you could even run code not in an LPAR on a zSeries machine
anymore. At which point you still have all the MMUs (and their associated
TLBs) turned on. It's just like you're running a unikernel rather than
something like Linux, so you should be able to replicate that on commodity
hardware for far cheaper.

~~~
erichocean
You can run Linux like a unikernel by limiting it to just one core at launch,
and giving the other cores 100% to your app. No TLB thrashing, context
switches, nothing. Just your code running without any interference whatsoever
by the kernel or other processes...

And you can still SSH into your machine, run top, gdb, etc. :-)

~~~
monocasa
That's not enough last time I checked. Hence kernels like Kitten that are
Linux syscall compatible, but are designed for far less jitter.

[https://github.com/HobbesOSR/kitten](https://github.com/HobbesOSR/kitten)

~~~
MertsA
The idea isn't to use the Linux kernel at all for the applications data path.
You can have a Linux distro start your application and then once it's running
you don't necessarily need to make any syscalls at all. You can pass through
devices to userspace directly using tools like DPDK, SPDK, or NVMeDirect so
that you could e.g. let Linux handle the gigabit ethernet management port and
let DPDK control the 100G Mellanox card. You could boot the server off of some
SATA SSDs and use SPDK to give your application direct access to the fancy
optane 905P. Rather than using OS threads and have the kernel handle
scheduling you can use green threads and still get concurrency even on a
single core and still be able to write your code such that you can still meet
any latency or throughput requirements you have.

You get all the benefits of a Linux server as far as orchestration and your
control layer are concerned but also most of the benefits of running it bare
metal like a unikernel.

~~~
monocasa
> and still be able to write your code such that you can still meet any
> latency or throughput requirements you have.

That's just not true. That's why for example, hard real time work using linux
universally runs the real time code in a higher context than the kernel, not a
lower one. Going back to the original topic, TPF like unikernels have tighter
real time constraints than can be satisfied by Linux, even with all the
preemption games you can play. Just like running standard real time code
that's, say, managing the speed of turbines in a jet engine, can't be done in
a regular linux process. If you really want Linux in either of these envs, you
run something like l4linux or rtlinux which creates a higher privilege VMM
like component that can give you real time guarantees.

That's because even when you're not using the kernel for data plane ops for
syscalls, that core is still subject to kernel book keeping preemption events
like TLB shootdown IPIs, and pretty much anything in a for_each_cpu block in
the kernel.

~~~
MertsA
I'll admit I shouldn't have said "any" latency requirements but I was speaking
specifically about optional cooperative concurrency within an OS thread. I
realize that this doesn't magically give you a platform on par with the kind
of hard real time deterministic performance you get out of e.g. the new
Cortex-R82.

But as for kernel bookkeeping tasks still causing preemptions, I'm talking
about using isolcpus and nohz_full. Upon digging further there's the proposed
task isolation mode patch to eliminate literally all possible preemptions but
even without that in practice it seems like the common wisdom of using
isolcpus and nohz_full is more or less sufficient. Here's a blog entry from
May that tests this exact thing on a modern kernel without using the proposed
patch.

[https://www.codeblueprint.co.uk/2020/05/03/reducing-
jitter-o...](https://www.codeblueprint.co.uk/2020/05/03/reducing-jitter-on-
linux-with-task-isolation.html)

Key point: "I’ve successfully used this technique to verify that I can run a
bash busy-loop for an hour without entering the kernel."

There's certainly things you could do that would cause a preemption but to use
your example, you're not going to get an IPI to flush some entry in the TLB if
your program doesn't have some shared pages mapped in where the mapping
changes. Writing to an mmapped file is going to cause the TLB to be modified
once those dirty pages are flushed to disk and if you're only allowing that to
happen by kthreads on a different processor then yeah, obviously it needs an
IPI to do that. But you have the option of being as restrictive as you need to
be for your particular use case. If you don't care about the minor latency
increase of an occasional page fault you don't have to go all in on a full
RTOS.

------
justicezyx
A well motivated exploration.

The paper largely:

1\. Lay the reasoning of direct physical memory programs. The primary argument
is that the cumulated complexity in chip and OS, for virtual memory is too
large; and that limits the design scope and retards advancement (huge page is
used as an example). Meanwhile, 5 major changes needed in OS and application
programs to use physical memory directly were listed, which appears fairly
straightforward (no doubt more complexity is hidden).

2\. The paper then explored the performance penalty of supporting large
continuous memory abstraction in such a new programming paradigm. They
proposed 2 cases needed to support the abstraction: dynamic allocated stack,
and tree-based array. Experiments are carried out to run modified program that
addresses these 2 cases.

One thing I could not easily see is how their experiment rules out the
existing virtual memory's impact to their modified programs, which are
supposed to not need to use virtual memory.

~~~
projektfu
Problems line up on memory fragmentation and dynamic allocation. Systems like
Macintosh (before OS X) and Windows 1-3 handled this with an additional level
of indirection (handles) and relocation/compaction. You could also do it with
smarter code that explicitly tagged pointers so they could be updated when the
data moved.

------
jnwatson
What an odd coincidence. Just yesterday, I was benchmarking the overhead of
turning on the MMU on an embedded system. It was about a 10% hit in
performance, and this is without paging or even changing the mapping at all.

    
    
        In this paper, we envision a _physically addressed_software architecture, without address translation, in which the OS hands out fixed-size blocks of memory, and applications allocate data structures across those blocks.
    

This is essentially nearly every embedded system I've ever worked on. The
dumbest possible memory allocator starts at a base and increments, never
freeing memory except possibly on a major frame.

~~~
wyldfire
Doug Lea's malloc is not too difficult to port to embedded platforms and will
probably perform better than more naive implementations. Probably need an sbrk
or similar, not sure what else.

But IMO maybe paying that 10% is worth it in a lot of cases for the savings in
design/development cost.

------
1vuio0pswjnm7
Text-only, non-PDF:

[https://www.arxiv-vanity.com/abs/2009.06789](https://www.arxiv-
vanity.com/abs/2009.06789)

(arxiv.org also blocks any request without a user-agent)

~~~
eternalban
It's actually pretty cool of arxiv to provide a gzip of the source files for
the pdf, here TeX:

[https://arxiv.org/format/2009.06789](https://arxiv.org/format/2009.06789)

------
bfuclusion
How would you do things like ASLR and prevent people from guessing other
software's memory allocation? Resetting a processes base pointer to 0 hides
the information about other processes in the system that in a multi-tenant
system you probably don't want leaking.

~~~
jibcage
It’s still possible to have memory _protection_ without having to do address
translation. In a more extreme example, imagine every program on a system
sharing the same 128-bit address space. There’s plenty of room for everyone,
TEXT, stacks, heaps, etc, and if you try to step on another program’s memory
you can still get a segfault. The Mill architecture sort of works this way,
except it goes even further by leveraging this in a clever way to perform fast
IPC.

[https://m.youtube.com/watch?v=7KQnrOEoWEY](https://m.youtube.com/watch?v=7KQnrOEoWEY)

~~~
bfuclusion
Right, I can imagine if you wrap memory allocation correctly you can do that.
But CPUs have finite memory, so a program can guess where it sits in the real
physical memory, and from that derive a potential list of programs that
already exist. You can probably even run profiling and figure out _what_ other
software is running, by their allocation totals.

~~~
jhardy54
I'd be interested in how feasible this would be. Even tiny address spaces like
256 bytes have tons of potential ways to fill it (factors): a process could
have : 1, 2, 4, 8, 16, 32, 64, 128, or even 256 bytes. How do you identify
which are running?

------
nielsbot
Anybody remember Mac OS handles? One solution to lacking VM was to always use
pointers to pointers. That way the underlying allocations could be moved
around by the OS.

