
Virtual Memory Tricks - signa11
http://ourmachinery.com/post/virtual-memory-tricks/
======
krylon
It might be worth pointing out that the OpenBSD developers replaced their
malloc(3) implementation with one based on mmap(2) ... quite a while ago.

IIRC, it attempts to use end-of-page allocation by default for blocks of a
certain size. It is also possible to enable guard pages on a per process basis
or globally.

Thanks to this, memory corruption is much easier to detect and debug on
OpenBSD.

~~~
umanwizard
Isn’t malloc based on mmap (or some similar equivalent like mach_vm_allocate)
in all systems?

~~~
rurban
Yes, we know that OpenBSD is famous for ignoring performance. A syscall for
every tiny malloc, where in normal systems the threshold is 128K.

~~~
krylon
Not sure why you are getting downvoted.

IIRC, OpenBSD is slightly more sophisticated than issuing one syscall per call
to malloc(3) and friends. Just slightly, but still. ;-)

Still, I do not like expressing disagreement by downvoting. Sorry about that.

~~~
rurban
Yeah, I was a bit wrong. It does a syscall for every 4K. Normal systems use a
slow syscall only for malloc > 128K. I see no other advantage than extremism
in security, and giving a damn about performance.

------
dragontamer
Shared virtual memory: when a CPU allocates a region of memory, and then the
GPU synchronizes to the same region of memory.

Any reads / writes to that virtual memory address work in both CPU and GPU
code. So the CPU can do highly sequential tasks like balancing a binary tree
(with all of the pointers involved), while the GPU can do highly parallel
tasks like traversing the tree exhaustively. Pointers inside of this shared
region remain legitimate, because each address is correctly identified between
the CPU and GPU.

See this for example: [https://software.intel.com/en-
us/articles/opencl-20-shared-v...](https://software.intel.com/en-
us/articles/opencl-20-shared-virtual-memory-overview)

I'm describing "Fine Grained Shared Virtual Memory" btw, in case anyone wants
to know what to search for.

In effect, it works because the CPU does an mmap, and then the GPU does an
mmap, and then the PCIe drivers work to keep the mmap'd region in sync. With
modern acquire / release memory fence semantics + atomics properly defined,
its even possible to write "Heterogeneous" code where the CPU and GPU works
simultaneously on the same memory space.

In theory. I haven't tested these features out yet, and its quite possible
that current implementations are buggy. But the features exist and its totally
possible (to attempt) to do this sort of stuff on hardware today.

~~~
chubot
Hm I hadn't heard of this, that's interesting. That seems fragile but I'm not
an expert in this area.

I found this recent CppCon video pretty interesting. It's about running
vanilla C++ on (nvidia) GPUs and what the implications are for the C++ memory
model:

[https://www.youtube.com/watch?v=86seb-
iZCnI](https://www.youtube.com/watch?v=86seb-iZCnI)

------
FRex
I like the article and the tricks. I recall seeing a presentation by a game
developer who mentioned something similar, something about intermixing actual
data pages with something else (read only ones or even unusable ones?) to
trigger an error in case of an out of bounds read/write that was far enough
(which apparently happened often enough for this to be helpful).

The address space is not actually 64 bit and it's not just the OS limiting it.
The CPU/MMU/IDK (I'm not low level enough to get all this stuff) does too in
some ways. I assume these values mean that I can have 2^39 bytes of actual RAM
and 2^48 bytes of virtual address space:

>[frex@localhost ~]$ cat /proc/cpuinfo | grep address

>address sizes : 39 bits physical, 48 bits virtual

>address sizes : 39 bits physical, 48 bits virtual

>address sizes : 39 bits physical, 48 bits virtual

>address sizes : 39 bits physical, 48 bits virtual

This also came up back when I was looking through PUC Lua 5.1 code for a paper
and (out of curiosity) looking into code of PUC 5.2 and 5.3 to see the changes
and into code of LuaJIT for comparison.

LuaJIT uses top bits of a pointer to store type info in case of light userdata
pointers: [https://www.freelists.org/post/luajit/Few-questions-about-
Lu...](https://www.freelists.org/post/luajit/Few-questions-about-LuaJIT-
internals-on-64-bit,1)

This reminds me I should probably do my write up of how PUC Lua 5.1 VM works
and implements it's features in English finally. If someone's interested in
any way I'd appreciate dropping me a word to let me know that there's market
for that so I don't fear that I will be writing down stuff I already grok for
no one to read and benefit from.

~~~
colejohnson66
> I assume these values mean that I can have 2^39 bytes of actual RAM and 2^48
> bytes of virtual address space:

Yes. There’s no point having 64 physical address lines if 26 of them aren’t
going to be used by anything

~~~
IcePic
Amiga programming manuals were very clear about not (ab)using the top 8 bits
even if the A1000/500/2000 were only equipped with 24 physical pins for memory
(limited to 16M), so of course Microsoft Basic for amiga used the upper bits,
and when the 68020+ cards arrived, it broke since the upper 8 bits actually
pointed somewhere.

What is inconceivable today, will be reality at some point.

~~~
dfox
On amd64 (and IIRC also Alpha) hardware checks that unused bits of address are
all equal to highest used bit (or IIRC all zero on Alpha) to discourage this
kind of tricks.

And I vaguely remember that ESA/390 is actually 32bit address architecture in
hardware and it's 31bitness comes from using highest bit of address as a flag
that such tricks are taking place.

------
tom_mellior
> Also, it is possible that another set of pages could be allocated directly
> after ours in memory, in which case we wouldn’t detect overwrites beyond the
> last page. [...] For the second issue, we could reserve an extra page after
> our pages, but not commit that page. No other object could then claim those
> addresses and writing to them would still cause an access violation. (Note:
> This only works on Windows, where reserve and commit are separate
> operations.)

On Linux (and POSIX in general), you can do the same thing by allocating the
guard page as described, then setting its protection to PROT_NONE using
mprotect(2). Any access to the page will cause an error.

~~~
j_s
[http://www.triplefault.io/2017/08/detecting-debuggers-by-
abu...](http://www.triplefault.io/2017/08/detecting-debuggers-by-abusing-
bad.html)

> _Since Windows will incorrectly assume that our int 3 exception was
> generated from the single-byte variant, it is possible to confuse the
> debugger into reading "extra" memory. We leverage this inconsistency to trip
> a "guard page" of sorts._

------
nayuki
My favorite virtual memory trick, in the context of operating system kernel
development, is to make one of the top-level page table entries point to the
physical location of the page table itself. This makes it easy to manipulate
the page table in the virtual address space, without turning off paging. Some
articles on the topic:

* [http://wiki.osdev.org/Page_Tables#Recursive_mapping](http://wiki.osdev.org/Page_Tables#Recursive_mapping)

* [https://medium.com/@connorstack/recursive-page-tables-ad1e03...](https://medium.com/@connorstack/recursive-page-tables-ad1e03b20a85)

------
magnat
Wouldn't using those tricks cause significant performance loss due to kernel
syscall and page table updates needed for each virtual_alloc call?

~~~
ZeppelinPL
Yeah, using syscall to get uniqueId is kind of controversial...

~~~
gizmo
You don't do a syscall for _every_ uniqueId, only once every PAGE_SIZE. It's
not that expensive.

Not every object needs a unique identifier. Objects that need them tend to be
larger and have indeterminate lifecycles. Otherwise the overhead of storing a
unique id would by itself be egregious, regardless of the process of
generating the id. If you can afford to store the id then you can afford the
overhead of an occasional syscall.

~~~
amscanne
There are secondary effects that also make this not a great idea.

You may wind up slowing leaking vmas (the kernel needs to track what’s been
mapped, even if it doesn’t allocate physical pages). This makes future mmap
calls and page faults slower.

Generating a uniqueID is trivial and super fast (one atomic operation). What’s
the benefit here?

------
blibble
author's virtual_alloc() function is mostly redundant as most malloc() and
calloc() implementations will use mmap() internally if the allocation is large
(the result being that that physical page won't be allocated until you touch
the memory)

also I think that double mapped ring buffer implementation is bogus without a
volatile keyword in there somewhere

~~~
DSMan195276
> also I think that double mapped ring buffer implementation is bogus without
> a volatile keyword in there somewhere

I agree the ring buffer is problematic. The `read` function for the
`ring_buffer_t` is a bad design - it returns a pointer into the actual
ring_buffer_t buffer, but also increments the read pointer. There's no
guarantee for how long the returned pointer is actually valid - eventually
writers will overwrite it and you will have no idea. I suppose if you're only
looking at a single-threaded environment though then it's not as a big of a
deal, but for the normal implementation "read" should just do a `memcpy` like
`write` does.

As for `volatile`, making the `data` pointer `volatile` should make this valid
- though you can't use `memcpy` after that, so you have to do the loop
yourself unfortunately.

That said, I think you could "cheat" (IE. don't do this) and ensure the
correct behavior by changing `ring_buffer_t` to have two `data` pointers to
the buffer instead of one - one for reading, and one for writing. The compiler
would have to assume the two pointers can alias each-other at any point within
the buffer, and thus it should force the compiler produce correct code even
with multiple reads and writes in the same function (Because when you do a
write, it has to assume data it previously read may have changed). I don't
think there's any guarantee though - a sufficiently smart compiler could
possibly make inferences that would break this. In particular, the C standard
says that the C memory layout is completely linear, so the compiler could make
an assumption like "If address 1 aliases with the first byte of the other
buffer, then address 4097 doesn't" and that would obviously not be true.
Unless you hard-coded the addresses though, I don't think any current
compilers would ever make such an optimization - but again, nobody should do
this, I don't know for sure. It's more of a curiosity then a good idea.

------
faragon
realloc() is very efficient in most POSIX systems (Linux included) because of
the C library using mmap techniques (e.g. Glibc). As contrast, in Windows,
using the realloc() function provided in the Microsoft C library, it is slow
on large arrays because of not using mmap techniques (as far as I know, the
only option is to use VirtualAlloc() function and tune it by hand), involving
a memory copy on every resize if the allocator has no adjacent memory
reserved. So this post is interesting for Windows users using large arrays
involving realloc() calls (the typical way of reducing the impact of the non-
optimal realloc() on systems without MMU is to do geometric space allocation,
in order to reduce the number of calls to the re-allocator, e.g. when growing,
instead of adding space for one element, grow 25% the size of the array).

------
mcculley
On Linux, one used to be able to execute mmap() on /proc/self/mem. I used this
to implement bank switching for an emulator. Despite Torvalds' recent rant
about preserving backward compatibility for userspace, this feature was
removed at some point.

~~~
koverstreet
Did you ever complain on the list, or look into why it was removed?

I'd be really curious why it was removed myself.

~~~
mcculley
I didn't complain. At some point I upgraded my kernel and the code stopped
working. This was over 10 years ago. I seem to remember going to the list and
seeing that it had been removed due to a security problem. I also seem to
remember a Torvalds line about only crazy people wanting to use such a
feature. I had lost interest in maintaining the code, so I didn't pursue it.

